Hi,

Me again with a compression problem (I say again because i already had two posts on it but i am sooo far out of my league here that i will try to ask the monks for another brainstorming session here).

This time the issue is that i have a long list of bit-vectors that are taking up too much space. When i look at them at byte scale i see not redundancy but when i inspect the bit level there is roughly space for ~8x compression to be exploited. To illustrate the problem i've written the model down :

use strict; srand(2); # create a random string of 1's and 0's (string length = $l) sub make_rand_str{ my ($l) = @_; if ($l%8!=0){ print STDERR "not a divider of a byte 8-bits)\n"; die; } my $res = ''; for (1..$l){ $res .= int(rand(2)); } return $res; } # take a string and randomly flip 1's/0's for <= $n number of positio +ns sub flip { my ($n,$str) = @_; my $res = ''; my @str = split("",$str); for (0..($n-1)){ my $i = int(rand(length($str)-1)); $str[$i] = ($str[$i] eq '1' ? '0': '1' ); } return join("", @str); } # pack the string sub pack_str { my ($s) = @_; my $res = ''; my @str = split("",$s); for(my $i=0; $i<@str; $i+=8){ my $st = join("",@str[$i..($i+7)]); $res .= pack('B8',$st); } return $res; } #------------------------Main---------------------# # though the string length here is set to 80 in reality it is anywhere + between 300-1000 my $str_len = 80; my $num_of_flips = int($str_len * 0.25); my $randstr = &make_rand_str($str_len); print $randstr . "\n"; open(UNPACKED,">", "unpacked.str") || die $!; open(PACKED,">:raw", "packed.str") || die $!; for (1..1000){ my $mut_str = &flip($num_of_flips,$randstr); print UNPACKED "$mut_str\n"; my $packed_str = &pack_str($mut_str); print PACKED "$packed_str"; } close UNPACKED; close PACKED; system("cat unpacked.str | gzip -9 -c - > unpacked.str.gz"); system("cat packed.str | gzip -9 -c - > packed.str.gz");
when i try to compress the generated (bit-)string dictionary i get :
bytes file ----------------- 11000 packed.str 9979 packed.str.gz 81000 unpacked.str 12777 unpacked.str.gz
this demonstrates the existence of repetitions and low complexity regions that can be exploited at byte level by gzip, but clearly gzip does not read at bit level.

Therefore my question is:

Given a set of bit-vectors like the ones above, what would be the best way to shrink down the list and if possible allow for random access to a given vector purely by the means of the line number (situation where i would like to extract a bit-vector at line x from a compressed format interests me).

I know i am asking a lot here but any suggestion is appreciated (even the negative one as long as it is cleverly phrased )? thnx


In reply to Bit-vector compression by baxy77bax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.