Hi MimisIVI
Your problem has been playing on my mind. You are actually talking about 2 problems. Data serialization and data compression.
The data you want to store is in perl structure terms:
{ word1 => { doc_id1 => [ pos1 pos2...posN ],
doc_id2 => [ pos1 pos2...posN ],
word2 => { doc_id1 => [ pos1 pos2...posN ],
doc_id2 => [ pos1 pos2...posN ],
Thinking about the problem reveals. We need a 64 bit doc ID because at 32 bits we are limited to 4294967296 doc_ids but we don't require all 64 bits. 48 bits gives us 281,474,976,710,656 ie 281 thousand billion doc_ids. More on this later. Our pos numbers represents the position in the doucument. An average document is 5k so the average pos is 2.5k. 2500 is 100111000100 in binary and the gamma encoding of this is 00000000000100111000100 (the number is 100111000100 or 12 bits and we need 11 leading zeros so we need 23 bits for our average case). It is worth noting that once we pass 65535 (16 bits) gamma endoded integers occupy 32 or more bits ie it is MORE expensive space wise to store them gamma encoded than as a simple 4 byte int.
If we were simply to use a 4 byte int we would need 32 bits so compression has an average case space saving of only 9 bits (assuming an average value of 2.5k) AND we have not done the serialisation required.
While this saving is real it is only a saving of 25% so at the very best this can only improve performance by 25%. This is probably not worth the trouble of implementing as if it does not work 25% slower it will not work 25% faster. Although every little bit does add up this is not a big win order of magnitude saving area.
If we said OK we will use 4 bytes we have 32 bits to play with. We need 1-2 bits for the serialisation to effectively take the places of the spaces and ; you are using when demonstration the data structure.
data => doc_id pos1 pos2...posN; doc_id2 pos1....; etc
If we "steal" the 2 most significant bits (MSB) from our 32 bits we have 1,073,741,824 available values for pos (30 bits) which allows us to index docs up to a GB long. If we steal one bit we have 2,147,483,648. We have to draw the line somewhere and I would put it to you you probably only want to index say the first 2.5k and the last 2.5k of big documents (on the basis this should get you capture the introduction and conclusion).
Anyway I suggest this is reasonable so your serialised data structure could use the MSB as a flag that says in effect the next 64 bits are a doc_id. Because we are choosing fixed width 4 byte ints every 4 bytes that follows is by definition a pos number unless it has the MSB set. The next doc_id is marked by the MSB flag.
Is this what you are looking for?
|