Me again with a compression problem (I say again because i already had two posts on it but i am sooo far out of my league here that i will try to ask the monks for another brainstorming session here).
This time the issue is that i have a long list of bit-vectors that are taking up too much space. When i look at them at byte scale i see not redundancy but when i inspect the bit level there is roughly space for ~8x compression to be exploited. To illustrate the problem i've written the model down :
when i try to compress the generated (bit-)string dictionary i get :use strict; srand(2); # create a random string of 1's and 0's (string length = $l) sub make_rand_str{ my ($l) = @_; if ($l%8!=0){ print STDERR "not a divider of a byte 8-bits)\n"; die; } my $res = ''; for (1..$l){ $res .= int(rand(2)); } return $res; } # take a string and randomly flip 1's/0's for <= $n number of positio +ns sub flip { my ($n,$str) = @_; my $res = ''; my @str = split("",$str); for (0..($n-1)){ my $i = int(rand(length($str)-1)); $str[$i] = ($str[$i] eq '1' ? '0': '1' ); } return join("", @str); } # pack the string sub pack_str { my ($s) = @_; my $res = ''; my @str = split("",$s); for(my $i=0; $i<@str; $i+=8){ my $st = join("",@str[$i..($i+7)]); $res .= pack('B8',$st); } return $res; } #------------------------Main---------------------# # though the string length here is set to 80 in reality it is anywhere + between 300-1000 my $str_len = 80; my $num_of_flips = int($str_len * 0.25); my $randstr = &make_rand_str($str_len); print $randstr . "\n"; open(UNPACKED,">", "unpacked.str") || die $!; open(PACKED,">:raw", "packed.str") || die $!; for (1..1000){ my $mut_str = &flip($num_of_flips,$randstr); print UNPACKED "$mut_str\n"; my $packed_str = &pack_str($mut_str); print PACKED "$packed_str"; } close UNPACKED; close PACKED; system("cat unpacked.str | gzip -9 -c - > unpacked.str.gz"); system("cat packed.str | gzip -9 -c - > packed.str.gz");
this demonstrates the existence of repetitions and low complexity regions that can be exploited at byte level by gzip, but clearly gzip does not read at bit level.bytes file ----------------- 11000 packed.str 9979 packed.str.gz 81000 unpacked.str 12777 unpacked.str.gz
Therefore my question is:
Given a set of bit-vectors like the ones above, what would be the best way to shrink down the list and if possible allow for random access to a given vector purely by the means of the line number (situation where i would like to extract a bit-vector at line x from a compressed format interests me).
I know i am asking a lot here but any suggestion is appreciated (even the negative one as long as it is cleverly phrased )? thnx
In reply to Bit-vector compression by baxy77bax
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |