in reply to Bit-vector compression

I think you'll find that gzip does just as well with bits as with bytes. LZW compression algorithm (GIF images, and WinZip) is the only one I've ever studied in depth, which is not quite the same as gzip, but they have similar characteristics. In LZW, it builds a table of symbols as it encounters new combinations of byte sequences. In GIF, the references to that table are encoded in sub-byte packed bits. If you take your bit string, and convert it into a long string of ASCII '1' and '0', running it through LZW would give you basically the same results as a bit-aware version of the same algorithm. I expect it will come out to the same size as just encoding the bit vector as bytes (even though you started by expanding your input by 8x). The GIF implementation can be used to encode 2-color images, whose pixel patterns are essentially the same as wanting to find patterns in a string of bits, but the designers still choose to operate on the input at the byte level. This tells me that they didn't find any benefit of processing the input as bits (or maybe it was just too slow?)

LZW would do really badly as a random-access compression system. I think bzip and gzip would also be bad for that. This is because the tables required to encode a block of data are probably larger than the bytes saved by keeping the data compressed, so if you need to have the table available, you might as well just decode it and have the decoded data available.

However, almost all of the compression formats break their input into blocks, and the blocks are almost never larger than a few tens or hundreds of KB. So, depending on the length of your "lines", it might be practical to compress each individual line as a block, then write them out in an indexed format so that you can look up a line and decode it on demand.

Depending on the size of your lines, I would try one of these approaches:

A)

B)

Replies are listed 'Best First'.
Re^2: Bit-vector compression
by LanX (Saint) on Mar 02, 2022 at 11:54 UTC
      and yet each time a different problem :) that why i stated that i already had a two similar posts ;) on the other hand bit-strings/bit-vectors (terms sometimes uses interchangeably, though i do not like that) are quite different. there is very little room to maneuver as the description of a string is what is being the focus of compression.

      for future reference :

      it seams http://bitmagic.io/ has some interesting solutions.

      thnx!

        > there is very little room to maneuver as the description of a string is what is being the focus of compression.

        sorry?

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery