Say you have an array of boolean values. Each element of the array needs to be merely true or false. I'd normally implement the array this way.
my @bools; $bools[0] = 1; # element 0 is true $bools[1] = ''; # element 1 is false
To get a list (or count) of the indexes with true values, just grep.
my @trues = grep $bools[$_], 0 .. $#bools;
There's not much to do with such an array besides enumerate the true/false elements and check whether any one element is true or false.
Scaling up
But what if you have a very large array? Devel::Size tells me that the simple @bools above (with one true and one false) amounts to 116 bytes. Adding ten more elements (half true, half false), adds 252 bytes. An array of 10_000_000 elements is 287_108_904 bytes—to store data that's all single bits.
You can store the same data in a scalar as bits using vec.
my $bools; vec( $bools, 0, 1 ) = 1; # element 0 is true vec( $bools, 1, 1 ) = 0; # element 1 is false
The storage space required is much more modest—just 1_250_028 bytes for 10_000_000 elements.
You can still grep for true indexes:
my @trues = grep vec( $bools, $_, 1 ), 0 .. 8 * length $bools;
Unfortunately, that will still build a huge list to scan over, so a for loop is more memory efficient.
foreach my $i ( 0 .. 8 * length $bools ) { push @trues, $i if vec( $bools, $i, 1 ); }
If all you want is a count of true elements, a regular expression can scan over the field faster than a loop.
# for each character, count the on bits in that character my %bits_in; foreach my $char ( map { chr } 0 .. 255 ) { foreach my $bit ( 0 .. 7 ) { $bits_in{ $char } += vec( $char, $bit, 1 ); } } my $count; $count += $bits_in{ $1 } while ( $bools =~ m{([^\000])}g );
I also tested a destructive algorithm using chop, and it's usually faster than regular expressions. The regexp is still faster on a sparse array (with >90% false values).
my $count; $count += $bits_in{ chop $bits } while $bits;
Of course, destruction of the structure has its own problems. Either you lose the data structure, or you have to copy it. That could be a problem if memory is getting tight.
Not quite as fast as chopping is to use substr and walk down the bits eight at a time.
my $count; $count += $bits_in{ substr $bits, $_, 1 } for 0 .. length( $bits ) - 1 +;
Other implementations
I wonder if there's any interest in making this into a module that can be tied to an array. I see Tie::Array::Packed and Tie::Array::Pack on CPAN, apparently based on pack and unpack, but I don't see mention of single bit values being supported by them. There's Bit::Vector which operates on a vector like this, but it doesn't tie. I'm surprised not to find a tied array implementation based on single bits, so I'm wondering if I'm missing something.
I'd also be interested to hear other monks' implementations of enumerating or counting elements in the array. (If I were writing a tied array implementation, I'd just keep a count as a separate value, but that's not feasible for the application that lead me to this data structure. For me, it's useful to do bitwise operations on the fields I'm maintaining. I have an expression, "$x | $y ^ $y", which is every element in $x but not in $y. After that, I want to list each index.)
Thanks
Thanks to jeffa, tye, and ysth for their comments on an earlier draft of this meditation.
Performance testing
Here's the Benchmark I used for comparing performance:
use Benchmark qw( cmpthese ); my %bits_in; foreach my $char ( map { chr } 0 .. 255 ) { foreach my $bit ( 0 .. 7 ) { $bits_in{ $char } += vec( $char, $bit, 1 ); } } my @tests = ( [ 8, 1 ], [ 8, 8 ], [ 1_000, 1 ], [ 1_000, 10 ], [ 1_000, 100 ], [ 1_000, 500 ], [ 1_000, 900 ], [ 1_000_000, 1 ], [ 1_000_000, 100 ], [ 1_000_000, 100_000 ], [ 1_000_000, 500_000 ], [ 1_000_000, 900_000 ], ); #use Test::More; #plan 'tests' => 3 * scalar @tests; foreach my $t ( @tests ) { my $bits = mkbits( @{ $t } ); printf "%d, %d\n", @{ $t }; # my $v = vec_count( $bits ); # is( $v, re_count( $bits ), "re counted $v" ); # is( $v, chop_count( $bits ), "chop counted $v" ); # is( $v, loop_count( $bits ), "loop counted $v" ); cmpthese( -1, { 'vec' => sub { vec_count( $bits ) }, 're' => sub { re_count( $bits ) }, 'chop' => sub { chop_count( $bits ) }, 'loop' => sub { loop_count( $bits ) }, } ); } sub chop_count { my ( $bits ) = @_; my $count; $count += $bits_in{ chop $bits } while $bits; return $count; } sub vec_count { my ( $bits ) = @_; my $bit_count = length( $bits ) * 8; my $count; $count += vec( $bits, $_, 1 ) for 0 .. $bit_count; return $count; } sub re_count { my ( $bits ) = @_; my $count; $count += $bits_in{ $1 } while ( $bits =~ m{([^\000])}g ); return $count; } sub loop_count { my ( $bits ) = @_; my $count; $count += $bits_in{ substr $bits, $_, 1 } for 0 .. length( $bits ) + - 1; return $count; } sub mkbits { my ( $bitcount, $ones ) = @_; my $bits = "\000" x int( $bitcount / 8 ); while ( $ones-- ) { vec( $bits, rand $bitcount, 1 ) = 1; } return $bits; }
|
|---|