Actually yes, this is what i want to do..but i dont have the pack/unpack knowledge to do it...If we will forget the compress methods that i said can you give an example how to implement this posting lists structure??
About the positions that i save in my application..
For example if i have this text:
"You can do what ever you want,if you dont think you cant"
The positions of the word "you" are 1,6,9,12
I implement the GAmma and the Byte allign codes in perl and i have to say that the Gamma gives very compression results for the doc ids..because i dont save the real doc id for each document but the relevance diferences..
for example:
Original Posting List only with doc ids:
Post: 1 4 5 7 ....
The diferences:
Post: 13(4-1) 1(5-4) 2(7-5) .... i keep only for the first doc id the original id
Actually i use the same method with the diferences for the positions too..
So for a term which has very high DF (which is my botleneck) the diferences are very small(average 2) and with Gamma code i use only 3 bits!!! The problem is that the decoding is very complex (as BrowserUK said too)and i dont know how to import efficient positions for each docid......i am using the substr function to read each bit from the bit string which i dont know if it is the most efficient way..
About the byte allign as i wrote in another thread is not so compressed as Gamma but the decoding is very simple ...
I just tried to figure out if it is worth the decoding time for these methods...but i really want to try your proposal...if you can help me...
About the spaces and ; that i use in my structure..
I use them to distinguish the doc ids and the positions in the ASCI string but if i will use binary representation of the structure and flags to separate them i think are uselless to keep them in the structure..