baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:
Me again with a compression problem (I say again because i already had two posts on it but i am sooo far out of my league here that i will try to ask the monks for another brainstorming session here).
This time the issue is that i have a long list of bit-vectors that are taking up too much space. When i look at them at byte scale i see not redundancy but when i inspect the bit level there is roughly space for ~8x compression to be exploited. To illustrate the problem i've written the model down :
when i try to compress the generated (bit-)string dictionary i get :use strict; srand(2); # create a random string of 1's and 0's (string length = $l) sub make_rand_str{ my ($l) = @_; if ($l%8!=0){ print STDERR "not a divider of a byte 8-bits)\n"; die; } my $res = ''; for (1..$l){ $res .= int(rand(2)); } return $res; } # take a string and randomly flip 1's/0's for <= $n number of positio +ns sub flip { my ($n,$str) = @_; my $res = ''; my @str = split("",$str); for (0..($n-1)){ my $i = int(rand(length($str)-1)); $str[$i] = ($str[$i] eq '1' ? '0': '1' ); } return join("", @str); } # pack the string sub pack_str { my ($s) = @_; my $res = ''; my @str = split("",$s); for(my $i=0; $i<@str; $i+=8){ my $st = join("",@str[$i..($i+7)]); $res .= pack('B8',$st); } return $res; } #------------------------Main---------------------# # though the string length here is set to 80 in reality it is anywhere + between 300-1000 my $str_len = 80; my $num_of_flips = int($str_len * 0.25); my $randstr = &make_rand_str($str_len); print $randstr . "\n"; open(UNPACKED,">", "unpacked.str") || die $!; open(PACKED,">:raw", "packed.str") || die $!; for (1..1000){ my $mut_str = &flip($num_of_flips,$randstr); print UNPACKED "$mut_str\n"; my $packed_str = &pack_str($mut_str); print PACKED "$packed_str"; } close UNPACKED; close PACKED; system("cat unpacked.str | gzip -9 -c - > unpacked.str.gz"); system("cat packed.str | gzip -9 -c - > packed.str.gz");
this demonstrates the existence of repetitions and low complexity regions that can be exploited at byte level by gzip, but clearly gzip does not read at bit level.bytes file ----------------- 11000 packed.str 9979 packed.str.gz 81000 unpacked.str 12777 unpacked.str.gz
Therefore my question is:
Given a set of bit-vectors like the ones above, what would be the best way to shrink down the list and if possible allow for random access to a given vector purely by the means of the line number (situation where i would like to extract a bit-vector at line x from a compressed format interests me).
I know i am asking a lot here but any suggestion is appreciated (even the negative one as long as it is cleverly phrased )? thnx
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Bit-vector compression
by Fletch (Bishop) on Mar 01, 2022 at 23:46 UTC | |
by eyepopslikeamosquito (Archbishop) on Mar 02, 2022 at 00:29 UTC | |
|
Re: Bit-vector compression
by jwkrahn (Abbot) on Mar 01, 2022 at 23:32 UTC | |
|
Re: Bit-vector compression
by NERDVANA (Priest) on Mar 02, 2022 at 09:48 UTC | |
by LanX (Saint) on Mar 02, 2022 at 11:54 UTC | |
by baxy77bax (Deacon) on Mar 03, 2022 at 00:37 UTC | |
by LanX (Saint) on Mar 03, 2022 at 01:05 UTC | |
|
Re: Bit-vector compression
by salva (Canon) on Mar 02, 2022 at 12:16 UTC | |
by baxy77bax (Deacon) on Mar 03, 2022 at 00:40 UTC |