I think you'll find that gzip does just as well with bits as with bytes. LZW compression algorithm (GIF images, and WinZip) is the only one I've ever studied in depth, which is not quite the same as gzip, but they have similar characteristics. In LZW, it builds a table of symbols as it encounters new combinations of byte sequences. In GIF, the references to that table are encoded in sub-byte packed bits. If you take your bit string, and convert it into a long string of ASCII '1' and '0', running it through LZW would give you basically the same results as a bit-aware version of the same algorithm. I expect it will come out to the same size as just encoding the bit vector as bytes (even though you started by expanding your input by 8x). The GIF implementation can be used to encode 2-color images, whose pixel patterns are essentially the same as wanting to find patterns in a string of bits, but the designers still choose to operate on the input at the byte level. This tells me that they didn't find any benefit of processing the input as bits (or maybe it was just too slow?)
LZW would do really badly as a random-access compression system. I think bzip and gzip would also be bad for that. This is because the tables required to encode a block of data are probably larger than the bytes saved by keeping the data compressed, so if you need to have the table available, you might as well just decode it and have the decoded data available.
However, almost all of the compression formats break their input into blocks, and the blocks are almost never larger than a few tens or hundreds of KB. So, depending on the length of your "lines", it might be practical to compress each individual line as a block, then write them out in an indexed format so that you can look up a line and decode it on demand.
Depending on the size of your lines, I would try one of these approaches:
A)
- Write each bit vector as its own file into a directory
- Create a squashfs image of that directory
- Mount the squashfs file
- Use nice simple file access to read individual bit vectors, which will automatically decompress in the kernel at extremely high speeds
- Repeat access to bit vectors uses Linux's disk cache, even across different programs and subsequent runs
B)
- Use LMDB_File (or similar) to create a database of memory-mapped binary blobs, with each bit vector as its own blob.
- Write a perl module that wraps it and compresses writes and decompresses reads, using IO::Compress::* and IO::Uncompress::* modules.
- Try each module and see which gives you the level of compression vs. speed that works for you