baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

Me again with a compression problem (I say again because i already had two posts on it but i am sooo far out of my league here that i will try to ask the monks for another brainstorming session here).

This time the issue is that i have a long list of bit-vectors that are taking up too much space. When i look at them at byte scale i see not redundancy but when i inspect the bit level there is roughly space for ~8x compression to be exploited. To illustrate the problem i've written the model down :

use strict; srand(2); # create a random string of 1's and 0's (string length = $l) sub make_rand_str{ my ($l) = @_; if ($l%8!=0){ print STDERR "not a divider of a byte 8-bits)\n"; die; } my $res = ''; for (1..$l){ $res .= int(rand(2)); } return $res; } # take a string and randomly flip 1's/0's for <= $n number of positio +ns sub flip { my ($n,$str) = @_; my $res = ''; my @str = split("",$str); for (0..($n-1)){ my $i = int(rand(length($str)-1)); $str[$i] = ($str[$i] eq '1' ? '0': '1' ); } return join("", @str); } # pack the string sub pack_str { my ($s) = @_; my $res = ''; my @str = split("",$s); for(my $i=0; $i<@str; $i+=8){ my $st = join("",@str[$i..($i+7)]); $res .= pack('B8',$st); } return $res; } #------------------------Main---------------------# # though the string length here is set to 80 in reality it is anywhere + between 300-1000 my $str_len = 80; my $num_of_flips = int($str_len * 0.25); my $randstr = &make_rand_str($str_len); print $randstr . "\n"; open(UNPACKED,">", "unpacked.str") || die $!; open(PACKED,">:raw", "packed.str") || die $!; for (1..1000){ my $mut_str = &flip($num_of_flips,$randstr); print UNPACKED "$mut_str\n"; my $packed_str = &pack_str($mut_str); print PACKED "$packed_str"; } close UNPACKED; close PACKED; system("cat unpacked.str | gzip -9 -c - > unpacked.str.gz"); system("cat packed.str | gzip -9 -c - > packed.str.gz");
when i try to compress the generated (bit-)string dictionary i get :
bytes file ----------------- 11000 packed.str 9979 packed.str.gz 81000 unpacked.str 12777 unpacked.str.gz
this demonstrates the existence of repetitions and low complexity regions that can be exploited at byte level by gzip, but clearly gzip does not read at bit level.

Therefore my question is:

Given a set of bit-vectors like the ones above, what would be the best way to shrink down the list and if possible allow for random access to a given vector purely by the means of the line number (situation where i would like to extract a bit-vector at line x from a compressed format interests me).

I know i am asking a lot here but any suggestion is appreciated (even the negative one as long as it is cleverly phrased )? thnx

Replies are listed 'Best First'.
Re: Bit-vector compression
by Fletch (Bishop) on Mar 01, 2022 at 23:46 UTC

    Question vague so just random thoughts but . . .

    • It's possible to use (say) Compress::Bzip2 to compress and uncompress strings in memory. You could certainly store a file of compressed data "lines" and read from that but the problem becomes knowing what's a "line" to read and then you're starting to need to implement your own file format with an index and . . . bleh. But not too bad (maybe inspired by the zip format have a TOC/index block at the end of the file which you read which provides a mapping of file offset (that you could pass to seek) and the number of bytes for that particular "line").
    • Problem with that approach though is that by splitting things into separate lines compressed separately you're going to "lose" some of the redundancy that gzip was able to bite on.
    • You haven't said why you need all these bits but something made me think of Bloom Filters; that may or may not be anywhere relevant. If you go a bit more into what you're trying to do with all of these you might get some more useful prior art pointers.

    Update: Actually after thinking it over you might just could use Archive::Zip and let it store each of your "lines" as a separate file in the archive. That'd probably get you most of the way there without needing to reimplement any wheels. If you're truly worried about runtime space implement some sort of LRU cache on top of that and only keep the last n actively accessed "lines" live in memory.

    Another thought depending on the sparseness of your bits using Set::IntSpan and a textual representation could have possibilities. But still . . . much handwavium here.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      I'd add Judy arrays to your excellent list of fun/desperate things to try. :)

Re: Bit-vector compression
by jwkrahn (Abbot) on Mar 01, 2022 at 23:32 UTC
    my $i = int(rand(length($str)-1));

    You have an off-by-one problem there.

    Say that $str has a length of 8 and so has characters at positions 0 through 7.

    length($str)-1 will produce random numbers in the range 0 through 6, one less than the length of the string.

    Update:

    This produces the same result as your pack_str subroutine:

    sub pack_str { my ( $s ) = @_; return pack 'B*', $s; }
Re: Bit-vector compression
by NERDVANA (Priest) on Mar 02, 2022 at 09:48 UTC
    I think you'll find that gzip does just as well with bits as with bytes. LZW compression algorithm (GIF images, and WinZip) is the only one I've ever studied in depth, which is not quite the same as gzip, but they have similar characteristics. In LZW, it builds a table of symbols as it encounters new combinations of byte sequences. In GIF, the references to that table are encoded in sub-byte packed bits. If you take your bit string, and convert it into a long string of ASCII '1' and '0', running it through LZW would give you basically the same results as a bit-aware version of the same algorithm. I expect it will come out to the same size as just encoding the bit vector as bytes (even though you started by expanding your input by 8x). The GIF implementation can be used to encode 2-color images, whose pixel patterns are essentially the same as wanting to find patterns in a string of bits, but the designers still choose to operate on the input at the byte level. This tells me that they didn't find any benefit of processing the input as bits (or maybe it was just too slow?)

    LZW would do really badly as a random-access compression system. I think bzip and gzip would also be bad for that. This is because the tables required to encode a block of data are probably larger than the bytes saved by keeping the data compressed, so if you need to have the table available, you might as well just decode it and have the decoded data available.

    However, almost all of the compression formats break their input into blocks, and the blocks are almost never larger than a few tens or hundreds of KB. So, depending on the length of your "lines", it might be practical to compress each individual line as a block, then write them out in an indexed format so that you can look up a line and decode it on demand.

    Depending on the size of your lines, I would try one of these approaches:

    A)

    • Write each bit vector as its own file into a directory
    • Create a squashfs image of that directory
    • Mount the squashfs file
    • Use nice simple file access to read individual bit vectors, which will automatically decompress in the kernel at extremely high speeds
    • Repeat access to bit vectors uses Linux's disk cache, even across different programs and subsequent runs

    B)

    • Use LMDB_File (or similar) to create a database of memory-mapped binary blobs, with each bit vector as its own blob.
    • Write a perl module that wraps it and compresses writes and decompresses reads, using IO::Compress::* and IO::Uncompress::* modules.
    • Try each module and see which gives you the level of compression vs. speed that works for you
        and yet each time a different problem :) that why i stated that i already had a two similar posts ;) on the other hand bit-strings/bit-vectors (terms sometimes uses interchangeably, though i do not like that) are quite different. there is very little room to maneuver as the description of a string is what is being the focus of compression.

        for future reference :

        it seams http://bitmagic.io/ has some interesting solutions.

        thnx!

Re: Bit-vector compression
by salva (Canon) on Mar 02, 2022 at 12:16 UTC
    Even if you are randomly flipping bits, the bit-strings are all aligned at a byte boundary, so that is actually equivalent to just randomly changing bytes in a byte-string. If that is your use case, then yes, just packing the strings and using a regular compressor would perform quite well.

    On the other hand, if in your real use case bit-strings are not aligned in any way, then, compressing them as bytes is going to perform somewhat worse. Something that may work better is to use Run Length Encoding (RLE) to transform the bit-string into a byte-string that can then be compressed using regular compression algorithms.

    In any case, the final result would vary greatly depending on the specifics of the bit-strings you are trying to compress and the actual number of flips. The 25% parameter you are using ensures that any bit-sequence longer than a few bits is mutated.

      well now when u "spell it out" to me like that i see the point.

      Thank you :)