Re: Bit-vector compression

I think you'll find that gzip does just as well with bits as with bytes. LZW compression algorithm (GIF images, and WinZip) is the only one I've ever studied in depth, which is not quite the same as gzip, but they have similar characteristics. In LZW, it builds a table of symbols as it encounters new combinations of byte sequences. In GIF, the references to that table are encoded in sub-byte packed bits. If you take your bit string, and convert it into a long string of ASCII '1' and '0', running it through LZW would give you basically the same results as a bit-aware version of the same algorithm. I expect it will come out to the same size as just encoding the bit vector as bytes (even though you started by expanding your input by 8x). The GIF implementation can be used to encode 2-color images, whose pixel patterns are essentially the same as wanting to find patterns in a string of bits, but the designers still choose to operate on the input at the byte level. This tells me that they didn't find any benefit of processing the input as bits (or maybe it was just too slow?)

LZW would do really badly as a random-access compression system. I think bzip and gzip would also be bad for that. This is because the tables required to encode a block of data are probably larger than the bytes saved by keeping the data compressed, so if you need to have the table available, you might as well just decode it and have the decoded data available.

However, almost all of the compression formats break their input into blocks, and the blocks are almost never larger than a few tens or hundreds of KB. So, depending on the length of your "lines", it might be practical to compress each individual line as a block, then write them out in an indexed format so that you can look up a line and decode it on demand.

Depending on the size of your lines, I would try one of these approaches:

Write each bit vector as its own file into a directory
Create a squashfs image of that directory
Mount the squashfs file
Use nice simple file access to read individual bit vectors, which will automatically decompress in the kernel at extremely high speeds
Repeat access to bit vectors uses Linux's disk cache, even across different programs and subsequent runs

Use LMDB_File (or similar) to create a database of memory-mapped binary blobs, with each bit vector as its own blob.
Write a perl module that wraps it and compresses writes and decompresses reads, using IO::Compress::* and IO::Uncompress::* modules.
Try each module and see which gives you the level of compression vs. speed that works for you

Comment on Re: Bit-vector compression

Replies are listed 'Best First'.
Re^2: Bit-vector compression by LanX (Saint) on Mar 02, 2022 at 11:54 UTC
see also How to efficently pack a string of 63 characters Data compression by 50% + : is it possible? etc. by same OP Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^3: Bit-vector compression by baxy77bax (Deacon) on Mar 03, 2022 at 00:37 UTC
and yet each time a different problem :) that why i stated that i already had a two similar posts ;) on the other hand bit-strings/bit-vectors (terms sometimes uses interchangeably, though i do not like that) are quite different. there is very little room to maneuver as the description of a string is what is being the focus of compression. for future reference : it seams http://bitmagic.io/ has some interesting solutions. thnx!	[reply]
Re^4: Bit-vector compression by LanX (Saint) on Mar 03, 2022 at 01:05 UTC
> there is very little room to maneuver as the description of a string is what is being the focus of compression. sorry? Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]