Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^6: Compress positive integers

by tachyon-II (Chaplain)
on Apr 09, 2008 at 14:59 UTC ( [id://679265]=note: print w/replies, xml ) Need Help??


in reply to Re^5: Compress positive integers
in thread Compress positive integers

Well basically the answer is yes I have done this, and so for that matter has BrowserUk. Not this exact task but a very similar one.

Let me clarify. In 2002-2003 BrowserUk and I worked on a project that pulled down the entire content of the then existent internet (looks about the same size as your dataset :-), tokenised it into 1, 2 and 3 "word/phrase" tokens, applied Bayesian algorithms to the data (which involved finding the intersection of multigigabyte data sets), ran all the data munging algorithms in perl using flat files, sorted and merged the results, and then dumped the results into a RDBMS at the end. We managed to reduce the initialisation runtime from greater than the known life of the universe to under 24 hours. This was done under the limitations of using quad Xeon processors, 4 GB of RAM and 160 GB of disk as I remember it.

In our system the final output was essentially a single huge RDBMS table. Our index (although bigger than the table) still ended up being small enough to fit into 4 GB of RAM and the systems that ran in production could manage 300+ queries per second in real time running perl, mysql, squid and apache. It ran fast because the final query required very little work, and the result could be found straight from the index.

Its hard to explain to you how much optimisation was required. To get linux to transfer data at 50 MB/sec you have to have fast disks but you also need to give it the freedom to do its job. You need to specifically give your RDBMS the ability to hold its indexes in memory and you must have the right indexes. You need to optimise your schema, you need to Benchmark and you must find and remove the bottlenecks. There is no point in making fast code faster. You need to make slow code faster first (bottleneck) and then make fast code faster. Have a look at this for a start. On a typical system you can improve the performance of mySQL by AN ORDER OF MAGNITUDE (10x) by modifying the my.cnf file - without changing anything else.

Every time I though I need to use C to make it faster either BrowserUk or I (usually him) came up with a solution (better algorithm) that let us do it in Perl.

You keep saying I want to do it this way. Well here is the rub. You don't seem to do C/C++. You don't seem to understand bit string operations. You don't seem to understand how the overhead of transfering data to perl when you don't need to is going to kill you. Most importantly you don't seem to understand that no amount of optimisation will make a slow algorithm fast.

Yes you can "pre-calculate" stuff using raw data and do it outside a DB. And yes it can make the final queries fast. We did. Google do. For example they don't try to query their index for proximity of "George Bush" as two words everytime someone types it in. They do it once, cache the result and then run off cache, updating periodically. Caching is everywhere. The disks do it, to OS does it, the RDBMS does it and all these caches can be optimised to the task at hand.

For example if you do what we did and tokenise every page we could find on the internet you will end up with around 100,000 "tokens". Not that many but if you go to two token pairs the worst case is 100,000^2 or 10 billion. This is of course a little less manageable however as it turns out the number of actual pairs is considerably less and if you ignore rare tokens ie pairs that happen less than a few times the number falls into the millions. With 3 word tokens you have to be a bit more brutal but you still end up with "only" around 100 million tokens (3 word triplets). Using the data storage format I suggested you need 4 bytes x 3 to store this data which gives you a table size of "only" 1.2 GB.

FWIW to do what you want you just need vec or the bitshift operators. << and >>. They won't help but HTH anyway.

Replies are listed 'Best First'.
Re^7: Compress positive integers
by MimisIVI (Acolyte) on Apr 09, 2008 at 16:32 UTC
    Hi tachyon-II one more time,


    First i want to thank you for your time to explain these things which are very interesting to me...

    I am not afraid to say that you are right about the things that you said for me, but because the IR is a field which i love it and i want to do it in my whole life i have the courage to learn everything that i need to be an expert on that..So comments like yours make me to try harder and harder.. thanks for that..

    About the indexes that i tried with MySQl i have to say that were completely on disk , even the lexicon too (i dont even used the query cashe of MySQL) ..Maybe that was stupid but i am trying to find a shema wich will be fast and small and can work in any PC without system optimizations..may that is stupid too but its something that i like to research so i dont care about my wasting time..


    About you comment for my algorithm, do you say for my index shema or the Elias gamma code?? I supposed for both ..:)I want to ask you if you tried this schema in your project (with the long posting lists..)?? I supposed yes...

    One thing that i want to listen from you guys is your opinion about Lucene index schema.. I AM trying to figure out what is going on with that index tool but my knowledge in low level commands in java and perl(for the Perl ports like KinoSearch) are zero..

    Thanks again for your patience ..

    Chears..

      I am just a hacker. Most of what I know I learnt from trying this, testing that, trying something else.

      The rough schema suggested will sort of work at scale. It should be considerably faster than how you are trying to do it currently but still has issues.

      The storage requirement is OK. The BIG table will take 4 bytes x 3 per record (3 ints). In rough terms your table will be around 240% the size of you total document size assuming 4 bytes per word average plus one byte for the space and pure ASCII documents. It will be less than this if the documents are .doc, .pdf, .htm etc as the textual content is less and that is what you use to index. Believe it or not (given your compression focus) it might be better to put in one "useless" INT to pad each record out to 16 bytes. This is becuase 16 bytes is a neat fit into 256/512/1024.... size disk sectors. Even if you don't pad your RDBMS may well add this padding for you - depends on the RDBMS. Will that make a difference? Dunno? Don't take my word for it. Make two big tables, with 12 and 16 byte records and benchmark it. It did used to make a difference. It may only speed the creation of the index (but for a big table this can take a long time). Given that everything rotates around access to this table you need to get it optimised as best you can first.

      Anyway, the table will be big BUT .... it that does not really matter. The reason is indexing.

      The simplest way to understand an index (and the whole O n O log n thing) is to recall the guess a number game: Guess a number between 0 and 128. You guess 64 (0+128)/2. Higher. You guess 96 (64+128)/2. Higher. You guess 112 (96+128)/2. Higher. You guess 120. Higher. You guess 124. Higher. You guess 126. Higher. You guess 127. Higher. 128!!!!. As opposed to getting the answer in (on average 64 tries) we got it in 7. Why 7. int(log2 128) = 7. This is actually the worst case, but the average case is 5-6 guesses so it is somewhat academic. If you just started at 0 and went up, or started at 128 and went down, on average we would take 64 guesses to get the right answer.

      So iteration is 0.5 O n, the binary search algoritm is O log(2) n. Why is this good? Well if we go to guess a number between 1 and 1,000,000 the iterative solution takes 500,000 tries on average (1,000,000 worst case) but the binary search takes only 19 tries (worst case). Get it? O n is bad at scale. O log n basically does not care about scale. Binary searches are like indexes. In essence indexes make O n operations O log n. Indexes cost space and time to calculate but once you have them calculated you turn a 500,000 time unit operation into a < 20 time unit operation! O log n scales. O n scales poorly. O n^2 dies at scale.

      To scale you must run O log n. O n might work if you have the horsepower and a smallish data set but if possible you do O n stuff in advance, cache and run an O log n system. You need caching far more than you need compression as the intersection problem is On*Ologn. What this basically means is that it will never scale if you try to run it real time. You need to cheat.

      You will find a lot of useful stuff here and in the links.

      Hi MimisIVI

      Your problem has been playing on my mind. You are actually talking about 2 problems. Data serialization and data compression.

      The data you want to store is in perl structure terms:

      { word1 => { doc_id1 => [ pos1 pos2...posN ], doc_id2 => [ pos1 pos2...posN ], word2 => { doc_id1 => [ pos1 pos2...posN ], doc_id2 => [ pos1 pos2...posN ],

      Thinking about the problem reveals. We need a 64 bit doc ID because at 32 bits we are limited to 4294967296 doc_ids but we don't require all 64 bits. 48 bits gives us 281,474,976,710,656 ie 281 thousand billion doc_ids. More on this later. Our pos numbers represents the position in the doucument. An average document is 5k so the average pos is 2.5k. 2500 is 100111000100 in binary and the gamma encoding of this is 00000000000100111000100 (the number is 100111000100 or 12 bits and we need 11 leading zeros so we need 23 bits for our average case). It is worth noting that once we pass 65535 (16 bits) gamma endoded integers occupy 32 or more bits ie it is MORE expensive space wise to store them gamma encoded than as a simple 4 byte int.

      If we were simply to use a 4 byte int we would need 32 bits so compression has an average case space saving of only 9 bits (assuming an average value of 2.5k) AND we have not done the serialisation required.

      While this saving is real it is only a saving of 25% so at the very best this can only improve performance by 25%. This is probably not worth the trouble of implementing as if it does not work 25% slower it will not work 25% faster. Although every little bit does add up this is not a big win order of magnitude saving area.

      If we said OK we will use 4 bytes we have 32 bits to play with. We need 1-2 bits for the serialisation to effectively take the places of the spaces and ; you are using when demonstration the data structure.

      data => doc_id pos1 pos2...posN; doc_id2 pos1....; etc

      If we "steal" the 2 most significant bits (MSB) from our 32 bits we have 1,073,741,824 available values for pos (30 bits) which allows us to index docs up to a GB long. If we steal one bit we have 2,147,483,648. We have to draw the line somewhere and I would put it to you you probably only want to index say the first 2.5k and the last 2.5k of big documents (on the basis this should get you capture the introduction and conclusion).

      Anyway I suggest this is reasonable so your serialised data structure could use the MSB as a flag that says in effect the next 64 bits are a doc_id. Because we are choosing fixed width 4 byte ints every 4 bytes that follows is by definition a pos number unless it has the MSB set. The next doc_id is marked by the MSB flag.

      Is this what you are looking for?

        Hi tachyon..,

        Actually yes, this is what i want to do..but i dont have the pack/unpack knowledge to do it...If we will forget the compress methods that i said can you give an example how to implement this posting lists structure??

        About the positions that i save in my application..
        For example if i have this text:
        "You can do what ever you want,if you dont think you cant"
        The positions of the word "you" are 1,6,9,12

        I implement the GAmma and the Byte allign codes in perl and i have to say that the Gamma gives very compression results for the doc ids..because i dont save the real doc id for each document but the relevance diferences..

        for example:
        Original Posting List only with doc ids:
        Post: 1 4 5 7 ....
        The diferences:
        Post: 1 3(4-1) 1(5-4) 2(7-5) .... i keep only for the first doc id the original id
        Actually i use the same method with the diferences for the positions too..

        So for a term which has very high DF (which is my botleneck) the diferences are very small(average 2) and with Gamma code i use only 3 bits!!! The problem is that the decoding is very complex (as BrowserUK said too)and i dont know how to import efficient positions for each docid......i am using the substr function to read each bit from the bit string which i dont know if it is the most efficient way..

        About the byte allign as i wrote in another thread is not so compressed as Gamma but the decoding is very simple ...
        I just tried to figure out if it is worth the decoding time for these methods...but i really want to try your proposal...if you can help me...

        Thanks again Guys..

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://679265]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (2)
As of 2024-04-20 01:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found