Re^7: Compress positive integers

Replies are listed 'Best First'.
Re^8: Compress positive integers by tachyon-II (Chaplain) on Apr 09, 2008 at 17:38 UTC
I am just a hacker. Most of what I know I learnt from trying this, testing that, trying something else. The rough schema suggested will sort of work at scale. It should be considerably faster than how you are trying to do it currently but still has issues. The storage requirement is OK. The BIG table will take 4 bytes x 3 per record (3 ints). In rough terms your table will be around 240% the size of you total document size assuming 4 bytes per word average plus one byte for the space and pure ASCII documents. It will be less than this if the documents are .doc, .pdf, .htm etc as the textual content is less and that is what you use to index. Believe it or not (given your compression focus) it might be better to put in one "useless" INT to pad each record out to 16 bytes. This is becuase 16 bytes is a neat fit into 256/512/1024.... size disk sectors. Even if you don't pad your RDBMS may well add this padding for you - depends on the RDBMS. Will that make a difference? Dunno? Don't take my word for it. Make two big tables, with 12 and 16 byte records and benchmark it. It did used to make a difference. It may only speed the creation of the index (but for a big table this can take a long time). Given that everything rotates around access to this table you need to get it optimised as best you can first. Anyway, the table will be big BUT .... it that does not really matter. The reason is indexing. The simplest way to understand an index (and the whole O n O log n thing) is to recall the guess a number game: Guess a number between 0 and 128. You guess 64 (0+128)/2. Higher. You guess 96 (64+128)/2. Higher. You guess 112 (96+128)/2. Higher. You guess 120. Higher. You guess 124. Higher. You guess 126. Higher. You guess 127. Higher. 128!!!!. As opposed to getting the answer in (on average 64 tries) we got it in 7. Why 7. int(log2 128) = 7. This is actually the worst case, but the average case is 5-6 guesses so it is somewhat academic. If you just started at 0 and went up, or started at 128 and went down, on average we would take 64 guesses to get the right answer. So iteration is 0.5 O n, the binary search algoritm is O log(2) n. Why is this good? Well if we go to guess a number between 1 and 1,000,000 the iterative solution takes 500,000 tries on average (1,000,000 worst case) but the binary search takes only 19 tries (worst case). Get it? O n is bad at scale. O log n basically does not care about scale. Binary searches are like indexes. In essence indexes make O n operations O log n. Indexes cost space and time to calculate but once you have them calculated you turn a 500,000 time unit operation into a < 20 time unit operation! O log n scales. O n scales poorly. O n^2 dies at scale. To scale you must run O log n. O n might work if you have the horsepower and a smallish data set but if possible you do O n stuff in advance, cache and run an O log n system. You need caching far more than you need compression as the intersection problem is On*Ologn. What this basically means is that it will never scale if you try to run it real time. You need to cheat. You will find a lot of useful stuff here and in the links.	[reply]
Re^8: Compress positive integers by tachyon-II (Chaplain) on Apr 11, 2008 at 04:21 UTC
Hi MimisIVI Your problem has been playing on my mind. You are actually talking about 2 problems. Data serialization and data compression. The data you want to store is in perl structure terms: `{ word1 => { doc_id1 => [ pos1 pos2...posN ], doc_id2 => [ pos1 pos2...posN ], word2 => { doc_id1 => [ pos1 pos2...posN ], doc_id2 => [ pos1 pos2...posN ],` [download] Thinking about the problem reveals. We need a 64 bit doc ID because at 32 bits we are limited to 4294967296 doc_ids but we don't require all 64 bits. 48 bits gives us 281,474,976,710,656 ie 281 thousand billion doc_ids. More on this later. Our pos numbers represents the position in the doucument. An average document is 5k so the average pos is 2.5k. 2500 is 100111000100 in binary and the gamma encoding of this is 00000000000100111000100 (the number is 100111000100 or 12 bits and we need 11 leading zeros so we need 23 bits for our average case). It is worth noting that once we pass 65535 (16 bits) gamma endoded integers occupy 32 or more bits ie it is MORE expensive space wise to store them gamma encoded than as a simple 4 byte int. If we were simply to use a 4 byte int we would need 32 bits so compression has an average case space saving of only 9 bits (assuming an average value of 2.5k) AND we have not done the serialisation required. While this saving is real it is only a saving of 25% so at the very best this can only improve performance by 25%. This is probably not worth the trouble of implementing as if it does not work 25% slower it will not work 25% faster. Although every little bit does add up this is not a big win order of magnitude saving area. If we said OK we will use 4 bytes we have 32 bits to play with. We need 1-2 bits for the serialisation to effectively take the places of the spaces and ; you are using when demonstration the data structure. `data => doc_id pos1 pos2...posN; doc_id2 pos1....; etc` [download] If we "steal" the 2 most significant bits (MSB) from our 32 bits we have 1,073,741,824 available values for pos (30 bits) which allows us to index docs up to a GB long. If we steal one bit we have 2,147,483,648. We have to draw the line somewhere and I would put it to you you probably only want to index say the first 2.5k and the last 2.5k of big documents (on the basis this should get you capture the introduction and conclusion). Anyway I suggest this is reasonable so your serialised data structure could use the MSB as a flag that says in effect the next 64 bits are a doc_id. Because we are choosing fixed width 4 byte ints every 4 bytes that follows is by definition a pos number unless it has the MSB set. The next doc_id is marked by the MSB flag. Is this what you are looking for?	[reply] [d/l] [select]
Re^9: Compress positive integers by MimisIVI (Acolyte) on Apr 11, 2008 at 15:29 UTC
Hi tachyon.., Actually yes, this is what i want to do..but i dont have the pack/unpack knowledge to do it...If we will forget the compress methods that i said can you give an example how to implement this posting lists structure?? About the positions that i save in my application.. For example if i have this text: "You can do what ever you want,if you dont think you cant" The positions of the word "you" are 1,6,9,12 I implement the GAmma and the Byte allign codes in perl and i have to say that the Gamma gives very compression results for the doc ids..because i dont save the real doc id for each document but the relevance diferences.. for example: Original Posting List only with doc ids: Post: 1 4 5 7 .... The diferences: Post: 1 3(4-1) 1(5-4) 2(7-5) .... i keep only for the first doc id the original id Actually i use the same method with the diferences for the positions too.. So for a term which has very high DF (which is my botleneck) the diferences are very small(average 2) and with Gamma code i use only 3 bits!!! The problem is that the decoding is very complex (as BrowserUK said too)and i dont know how to import efficient positions for each docid......i am using the substr function to read each bit from the bit string which i dont know if it is the most efficient way.. About the byte allign as i wrote in another thread is not so compressed as Gamma but the decoding is very simple ... I just tried to figure out if it is worth the decoding time for these methods...but i really want to try your proposal...if you can help me... Thanks again Guys..	[reply]
Re^10: Compress positive integers by MimisIVI (Acolyte) on Apr 11, 2008 at 16:20 UTC
About the spaces and ; that i use in my structure.. I use them to distinguish the doc ids and the positions in the ASCI string but if i will use binary representation of the structure and flags to separate them i think are uselless to keep them in the structure..	[reply]