dwhite20899 has asked for the wisdom of the Perl Monks concerning the following question:

I need to count how many bits are set in a binary file. I considered processing the strings from an "od -x" output, since that would remove the many consecutive zero bytes I expect, but I also took a whack with this:
perl -e '$bytes= -s "dwtest.dat"; open(FIN,"dwtest.dat") or die "cant open file\n"; $sum=0; for($i=0;$i<$bytes;$i++){ read(FIN,$c,1); (($c | 0x01) && ($sum++)); (($c | 0x02) && ($sum++)); (($c | 0x04) && ($sum++)); (($c | 0x08) && ($sum++)); (($c | 0x10) && ($sum++)); (($c | 0x20) && ($sum++)); (($c | 0x40) && ($sum++)); (($c | 0x80) && ($sum++)); } print "$sum bits are set\n";'
It works, but it's slow.

Does anyone have a faster solution?

update
pg is right, I'm a bonehead, a bitwise and on the ord is needed. I'm dealing with either 512MB or 4GB files. When I'm working with 512M, Randal's code flies, but because of the slurp, I think I'll have to go to jmcnamara's when I deal with 4G. Now I'm looking at minutes, rather than hours - thanks!

Replies are listed 'Best First'.
Re: counting set bits in binary file
by jmcnamara (Monsignor) on Jul 16, 2005 at 02:29 UTC

    Here is one way with a more efficient count using the unpack checksum:
    #!/usr/bin/perl -wl use strict; my $count = 0; my $long; open FILE, "file.txt" or die "Couldn't open file: $!\n"; binmode FILE; while (read FILE, $long, 4) { $count += unpack'%32b*', $long; } print $count; __END__

    Or make it more efficient by memoising the calculated values:

    #!/usr/bin/perl -wl use strict; my $count = 0; my %lookup; my $byte; open FILE, "file.txt" or die "Couldn't open file: $!\n"; binmode FILE; while (read FILE, $byte, 1) { next unless ord $byte; if ($lookup{$byte}) { $count += $lookup{$byte}; } else { $count += $lookup{$byte} = unpack'%8b*', $byte; } } print $count; __END__
    But a pure lookup table (as suggested above and previously) would be fastest of all.

    --
    John.

      That doesn't work:

      >perl -le "print(unpack('%32b*', 0x13))" 7

      There are 3 bits set in 0x13, not 7.

      Sorry, my mistake.

      >perl -le "print(unpack('%32b*', chr(0x13)))" 3
Re: counting set bits in binary file
by hv (Prior) on Jul 16, 2005 at 01:42 UTC
Re: counting set bits in binary file
by merlyn (Sage) on Jul 16, 2005 at 04:25 UTC
      The slurp gets problematic for very large files, of course, so here is a version that uses a fixed buffer. It runs at essentially the same speed - but uses a constant amount of memory instead.
      #!/opt/bin/perl use strict; use warnings; use Time::HiRes qw (gettimeofday tv_interval); my $file = shift; my $start_time = [gettimeofday]; open(FH,$file) or die("Can't open file: $!\n"); binmode FH; my $bytes = -s $file; my $cursor = 0; my $stride = 4000000; my $counted_bits = 0; while (1) { my $buffer; my $result = read(FH, $buffer, $stride); if (0 == $result) { last; } elsif (not defined $result) { warn("Error reading from $file: $!\n"); last; } my $bits = unpack "B*", $buffer; $counted_bits += $bits =~ tr/1/1/; } close (FH); print "Bits: $counted_bits\n"; my $elapsed = tv_interval($start_time); my $rate = $bytes / $elapsed; print "Bytes per second: $rate\n";
Re: counting set bits in binary file
by pg (Canon) on Jul 16, 2005 at 02:48 UTC

    Besides the performance issue you mentioned, I don't even think that your code works.

    You need to fix couple of things:

    1. you need & not |
    2. you need to do an ord()

    Test this:

    my $c = chr(0xff); my $sum = 0; if ($c) { (ord($c) & 0x01) && ($sum++); (ord($c) & 0x02) && ($sum++); (ord($c) & 0x04) && ($sum++); (ord($c) & 0x08) && ($sum++); (ord($c) & 0x10) && ($sum++); (ord($c) & 0x20) && ($sum++); (ord($c) & 0x40) && ($sum++); (ord($c) & 0x80) && ($sum++); } print $sum;

    Your code would look like this (there is no need to count the size of the file, I leave it to you to fix the performance)

    use strict; use warnings; open(FIN,"perl.exe") or die "cant open file\n"; my $sum = 0; my $c; while (read(FIN, $c, 1)) { if ($c) { (ord($c) & 0x01) && ($sum++); (ord($c) & 0x02) && ($sum++); (ord($c) & 0x04) && ($sum++); (ord($c) & 0x08) && ($sum++); (ord($c) & 0x10) && ($sum++); (ord($c) & 0x20) && ($sum++); (ord($c) & 0x40) && ($sum++); (ord($c) & 0x80) && ($sum++); } } print "$sum bits are set\n";
Re: counting set bits in binary file
by blazar (Canon) on Jul 18, 2005 at 08:57 UTC
    A somewhat related subject had been discussed throughly at clpmisc in this thread: count of 1s in a binary number (this is a Google groups link). IMHO it is worth reading in any case. Of course you have the added problem of reading in from a file, but in that case I guess that you may read in chunks and sum up...