Re^3: Compress positive integers

Replies are listed 'Best First'.
Re^4: Compress positive integers by tachyon-II (Chaplain) on Apr 09, 2008 at 00:35 UTC
You don't seem to be listening. What you are trying to do will never work at scale. You need to do all the heavy lifting on the RDBMS side. The heart of the problem is this: `column Posting List ==> "10 20;3 21 2; 43 100;105;......` [download] The core problem with this format is that you MUST pull it back into perl to manipulate it. The more documents you have and the more words per document the more data you need to pull to process your query. This will never scale. Not today, not tomorrow, not never. Compression will only vaguely paper over the cracks, if it helps at all (because of the overhead price). The table format suggested by BrowerUk is not optimal but is along the lines of what you want as it indicates the need for 3 basic tables. For best speed you want to avoid BLOB and VARCHAR ie anything that is not fixed width as this makes indexing much harder and it is the indexing that is going to provide the speed/scalability in the same way a hash allows you to have fast "random" access to a large array. The basic unit of data you want to store consists of `WORD IN_DOCUMENT AT_POSITION` [download] To efficiently store this you will want to store this unit of data in a compressed form. The compressed form you really want is one where the WORD becomes a WORD_ID and the IN_DOCUMENT becomes a DOC_ID. If you think about a search you want to input words to search for, thus the words table needs to rapidly convert a WORD to a WORD_ID. Using this we will look in our basic unit data table and get back some DOC_IDs which we then need to convert back to IN_DOCUMENT. Thus the tables you need are: `TABLE words word CHAR(32) PRIMARY KEY word_id INT TABLE docs doc_id INT PRIMARY KEY doc_name VARCHAR TABLE word_doc word_id INT doc_id INT pos INT` [download] To complete this we must create an index on the word_doc table. This table as it stands is just a big array. When we create an index on it it effectively becomes a big hash that we can get rapid random access to `CREATE INDEX warpspeed ON word_doc (word_id)` [download] With this structure in place you can then make a query into the word_doc table that will return the data you want. The basic query will be: `SELECT doc_id FROM word_doc WHERE word_id = ? INTERSECT SELECT doc_id FROM word_doc WHERE word_id = ?` [download] This will work but will still be slow. The problem is that it is a big task and we can't use the index to go straight to the data we want. The solution to this is to predict and cache.	[reply] [d/l] [select]
Re^5: Compress positive integers by MimisIVI (Acolyte) on Apr 09, 2008 at 13:19 UTC
Hi tachyon-II, First of all i have a question to you... Have you ever try to do these things that you say in practice??? Have you ever try to do the intersection that you propose to me in a table with 2000000000 records??? If you did it and the perfomance was sutisfied , i cant say anything... but because before end up in this shema i spend 4 months tried all these shema that you said with MySQL and the results wasnt sutisfied...(i read all the optimizations tips about MySQl query and believe me the perfomance wasnt good) i can say for sure that the shema with the posting lists is the best for large documents collections...(by the way Lucene somehow has the same index shema too...) NOw about the tables that you propose, the first is the very famous LEXICON and offcourse i use it in my application... The same happends with the second one where i save the path or the adress of the each document that i am indexing... Now one thing that i want to test is to use the file system to create my index instead of the DBMS(SQL SERVER 2005) but i dont know if i will be efficient to use in binmode files to save all the required info that i need.. One last question for me is how can i read from a bit string one bit per time... i fetch from the DBMS a binary compressed string with all the docids for one term.. To decode the string i use the below code wich is extremely fast... `my $decode = unpack "b",$compressed;` [download] But because i need to read from the string a bit per time and the below code is very slow ... `my @bits = split(//, unpack("b", $compressed));` i want to ask if there is anyway to read from the compressed or the decode string a bit per time... Thanks for your Time.. Mimis	[reply] [d/l] [select]
Re^6: Compress positive integers by tachyon-II (Chaplain) on Apr 09, 2008 at 14:59 UTC
Well basically the answer is yes I have done this, and so for that matter has BrowserUk. Not this exact task but a very similar one. Let me clarify. In 2002-2003 BrowserUk and I worked on a project that pulled down the entire content of the then existent internet (looks about the same size as your dataset :-), tokenised it into 1, 2 and 3 "word/phrase" tokens, applied Bayesian algorithms to the data (which involved finding the intersection of multigigabyte data sets), ran all the data munging algorithms in perl using flat files, sorted and merged the results, and then dumped the results into a RDBMS at the end. We managed to reduce the initialisation runtime from greater than the known life of the universe to under 24 hours. This was done under the limitations of using quad Xeon processors, 4 GB of RAM and 160 GB of disk as I remember it. In our system the final output was essentially a single huge RDBMS table. Our index (although bigger than the table) still ended up being small enough to fit into 4 GB of RAM and the systems that ran in production could manage 300+ queries per second in real time running perl, mysql, squid and apache. It ran fast because the final query required very little work, and the result could be found straight from the index. Its hard to explain to you how much optimisation was required. To get linux to transfer data at 50 MB/sec you have to have fast disks but you also need to give it the freedom to do its job. You need to specifically give your RDBMS the ability to hold its indexes in memory and you must have the right indexes. You need to optimise your schema, you need to Benchmark and you must find and remove the bottlenecks. There is no point in making fast code faster. You need to make slow code faster first (bottleneck) and then make fast code faster. Have a look at this for a start. On a typical system you can improve the performance of mySQL by AN ORDER OF MAGNITUDE (10x) by modifying the my.cnf file - without changing anything else. Every time I though I need to use C to make it faster either BrowserUk or I (usually him) came up with a solution (better algorithm) that let us do it in Perl. You keep saying I want to do it this way. Well here is the rub. You don't seem to do C/C++. You don't seem to understand bit string operations. You don't seem to understand how the overhead of transfering data to perl when you don't need to is going to kill you. Most importantly you don't seem to understand that no amount of optimisation will make a slow algorithm fast. Yes you can "pre-calculate" stuff using raw data and do it outside a DB. And yes it can make the final queries fast. We did. Google do. For example they don't try to query their index for proximity of "George Bush" as two words everytime someone types it in. They do it once, cache the result and then run off cache, updating periodically. Caching is everywhere. The disks do it, to OS does it, the RDBMS does it and all these caches can be optimised to the task at hand. For example if you do what we did and tokenise every page we could find on the internet you will end up with around 100,000 "tokens". Not that many but if you go to two token pairs the worst case is 100,000^2 or 10 billion. This is of course a little less manageable however as it turns out the number of actual pairs is considerably less and if you ignore rare tokens ie pairs that happen less than a few times the number falls into the millions. With 3 word tokens you have to be a bit more brutal but you still end up with "only" around 100 million tokens (3 word triplets). Using the data storage format I suggested you need 4 bytes x 3 to store this data which gives you a table size of "only" 1.2 GB. FWIW to do what you want you just need vec or the bitshift operators. << and >>. They won't help but HTH anyway.	[reply]
Re^7: Compress positive integers by MimisIVI (Acolyte) on Apr 09, 2008 at 16:32 UTC
Re^8: Compress positive integers by tachyon-II (Chaplain) on Apr 09, 2008 at 17:38 UTC
Re^8: Compress positive integers by tachyon-II (Chaplain) on Apr 11, 2008 at 04:21 UTC
Some notes below your chosen depth have not been shown here
Re^4: Compress positive integers by BrowserUk (Patriarch) on Apr 09, 2008 at 19:07 UTC
The problem with the Elias gamma encoding is that even if you implement the bit-twiddling in C, because the fields span byte boundaries, the process of compressing and decompressing your posting lists is slower than unpacking your ASCII representation using pack/unpack. So, whilst you might achieve a few percent more compression than with the simple binary packing I showed elsewhere, the time saved in reading and transmitting that slightly smaller volume of data, is negated by the time it takes to decompress it. The algorithm you posted in your OP, that is taken from the Buettcher paper seeks to address that problem, by only packing to whole byte boundaries. The problem with it is that the code show in the paper for compression and decompression is uselessly incomplete. Eg. Compression code given: `int outPos = 0, previous = 0; for (int inPos = 0; inPos < n; inPos++) { int delta = uncompressed[inPos] - previous; while (delta >= 128) { compressed[outPos++] = (delta & 127) \| 128; delta = delta >> 7; } compressed[outPos++] = delta; }` [download] Note: `n` in the for loop is never set. This can be derived from the length of the input. However, notice that `previous` is initialised to 0, and subtracted from each input number, but it is never set to any value other than zero! Decompression code given: `int outPos = 0, previous = 0; for (int outPos = 0; outPos < n; outPos++) { for (int shift = 0; ; shift += 7) { int temp = compressed[inPos++]; previous += ((temp & 127) << shift); if (temp < 128) break; } uncompressed[outPos] = previous; }` [download] Note: Again, `n` is never set. But this time, you cannot derive it from the input. The input is a bit-stream--ie. a string of bytes--but you cannot use `strlen` as the string will contain embed nulls. So, one algorithm is costs more than it saves and the other that seeks to address that is incomplete, and would probably still cost too much if implemented in Perl. The major costs of your schema are: the volume of data you are transferring for each 'hit'. The time taken to unpack that data. The time taken to sub-select the small number of relevant sets of document offsets from the large number you are retrieving. Each time, you are having to unpack/decompress the document offsets for every document that contains a particular word, in order to gain access to just those that match your complete query. Using your numbers: 5,000 docs * 554 words/doc / 15,000 unique words = each word appears in (on average) 185 documents. That means you read, transfer and unpacked/decompressed all the offsets for 185 documents for each word . But probabilistically, for any given 2 word query, you only need the data for 7 documents. So you've had to handle the data for 370 documents, but then discard it for 356 of them. If you went with the schema I presented, you only read, transfer and unpack/decompress the data for those 7 documents that contain both words. That's 14/370*100 = less than 4% of the data to be manipulated. That saving will far exceed any addition compression you might get from using any of the methods you've mooted. Or any other compression mechanism. I don't think that breaking the schema down beyond the word-doc-offsets (plural) I suggested, to word-doc-offset (singular), will help much. You still need to transfer all the offsets for each word-doc pairing anyway, and the explosion in the size of the indexes will probably slow things. I think tachyon's point about variable length fields is a good one, except that it only affects the performance of indexes, if the variable length data is a part of the index. In the schema I outlined this is not the case. But, I'm not a DBA. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]