in reply to Re^3: Compress positive integers
in thread Compress positive integers
The problem with the Elias gamma encoding is that even if you implement the bit-twiddling in C, because the fields span byte boundaries, the process of compressing and decompressing your posting lists is slower than unpacking your ASCII representation using pack/unpack.
So, whilst you might achieve a few percent more compression than with the simple binary packing I showed elsewhere, the time saved in reading and transmitting that slightly smaller volume of data, is negated by the time it takes to decompress it.
The algorithm you posted in your OP, that is taken from the Buettcher paper seeks to address that problem, by only packing to whole byte boundaries. The problem with it is that the code show in the paper for compression and decompression is uselessly incomplete. Eg.
Compression code given:
int outPos = 0, previous = 0; for (int inPos = 0; inPos < n; inPos++) { int delta = uncompressed[inPos] - previous; while (delta >= 128) { compressed[outPos++] = (delta & 127) | 128; delta = delta >> 7; } compressed[outPos++] = delta; }
Note: n in the for loop is never set. This can be derived from the length of the input. However, notice that previous is initialised to 0, and subtracted from each input number, but it is never set to any value other than zero!
Decompression code given:
int outPos = 0, previous = 0; for (int outPos = 0; outPos < n; outPos++) { for (int shift = 0; ; shift += 7) { int temp = compressed[inPos++]; previous += ((temp & 127) << shift); if (temp < 128) break; } uncompressed[outPos] = previous; }
Note: Again, n is never set. But this time, you cannot derive it from the input. The input is a bit-stream--ie. a string of bytes--but you cannot use strlen as the string will contain embed nulls.
So, one algorithm is costs more than it saves and the other that seeks to address that is incomplete, and would probably still cost too much if implemented in Perl.
The major costs of your schema are:
Each time, you are having to unpack/decompress the document offsets for every document that contains a particular word, in order to gain access to just those that match your complete query.
Using your numbers: 5,000 docs * 554 words/doc / 15,000 unique words = each word appears in (on average) 185 documents.
That means you read, transfer and unpacked/decompressed all the offsets for 185 documents for each word . But probabilistically, for any given 2 word query, you only need the data for 7 documents.
So you've had to handle the data for 370 documents, but then discard it for 356 of them.
If you went with the schema I presented, you only read, transfer and unpack/decompress the data for those 7 documents that contain both words. That's 14/370*100 = less than 4% of the data to be manipulated.
That saving will far exceed any addition compression you might get from using any of the methods you've mooted. Or any other compression mechanism.
I don't think that breaking the schema down beyond the word-doc-offsets (plural) I suggested, to word-doc-offset (singular), will help much. You still need to transfer all the offsets for each word-doc pairing anyway, and the explosion in the size of the indexes will probably slow things.
I think tachyon's point about variable length fields is a good one, except that it only affects the performance of indexes, if the variable length data is a part of the index. In the schema I outlined this is not the case. But, I'm not a DBA.
|
|---|