Your skill will accomplish what the force of many cannot |
|
PerlMonks |
Re^6: Compress positive integersby tachyon-II (Chaplain) |
on Apr 09, 2008 at 14:59 UTC ( [id://679265]=note: print w/replies, xml ) | Need Help?? |
Well basically the answer is yes I have done this, and so for that matter has BrowserUk. Not this exact task but a very similar one. Let me clarify. In 2002-2003 BrowserUk and I worked on a project that pulled down the entire content of the then existent internet (looks about the same size as your dataset :-), tokenised it into 1, 2 and 3 "word/phrase" tokens, applied Bayesian algorithms to the data (which involved finding the intersection of multigigabyte data sets), ran all the data munging algorithms in perl using flat files, sorted and merged the results, and then dumped the results into a RDBMS at the end. We managed to reduce the initialisation runtime from greater than the known life of the universe to under 24 hours. This was done under the limitations of using quad Xeon processors, 4 GB of RAM and 160 GB of disk as I remember it. In our system the final output was essentially a single huge RDBMS table. Our index (although bigger than the table) still ended up being small enough to fit into 4 GB of RAM and the systems that ran in production could manage 300+ queries per second in real time running perl, mysql, squid and apache. It ran fast because the final query required very little work, and the result could be found straight from the index. Its hard to explain to you how much optimisation was required. To get linux to transfer data at 50 MB/sec you have to have fast disks but you also need to give it the freedom to do its job. You need to specifically give your RDBMS the ability to hold its indexes in memory and you must have the right indexes. You need to optimise your schema, you need to Benchmark and you must find and remove the bottlenecks. There is no point in making fast code faster. You need to make slow code faster first (bottleneck) and then make fast code faster. Have a look at this for a start. On a typical system you can improve the performance of mySQL by AN ORDER OF MAGNITUDE (10x) by modifying the my.cnf file - without changing anything else. Every time I though I need to use C to make it faster either BrowserUk or I (usually him) came up with a solution (better algorithm) that let us do it in Perl. You keep saying I want to do it this way. Well here is the rub. You don't seem to do C/C++. You don't seem to understand bit string operations. You don't seem to understand how the overhead of transfering data to perl when you don't need to is going to kill you. Most importantly you don't seem to understand that no amount of optimisation will make a slow algorithm fast. Yes you can "pre-calculate" stuff using raw data and do it outside a DB. And yes it can make the final queries fast. We did. Google do. For example they don't try to query their index for proximity of "George Bush" as two words everytime someone types it in. They do it once, cache the result and then run off cache, updating periodically. Caching is everywhere. The disks do it, to OS does it, the RDBMS does it and all these caches can be optimised to the task at hand. For example if you do what we did and tokenise every page we could find on the internet you will end up with around 100,000 "tokens". Not that many but if you go to two token pairs the worst case is 100,000^2 or 10 billion. This is of course a little less manageable however as it turns out the number of actual pairs is considerably less and if you ignore rare tokens ie pairs that happen less than a few times the number falls into the millions. With 3 word tokens you have to be a bit more brutal but you still end up with "only" around 100 million tokens (3 word triplets). Using the data storage format I suggested you need 4 bytes x 3 to store this data which gives you a table size of "only" 1.2 GB. FWIW to do what you want you just need vec or the bitshift operators. << and >>. They won't help but HTH anyway.
In Section
Seekers of Perl Wisdom
|
|