in reply to Re^3: Compress positive integers
in thread Compress positive integers
You don't seem to be listening. What you are trying to do will never work at scale. You need to do all the heavy lifting on the RDBMS side. The heart of the problem is this:
column Posting List ==> "10 20;3 21 2; 43 100;105;......
The core problem with this format is that you MUST pull it back into perl to manipulate it. The more documents you have and the more words per document the more data you need to pull to process your query. This will never scale. Not today, not tomorrow, not never. Compression will only vaguely paper over the cracks, if it helps at all (because of the overhead price).
The table format suggested by BrowerUk is not optimal but is along the lines of what you want as it indicates the need for 3 basic tables. For best speed you want to avoid BLOB and VARCHAR ie anything that is not fixed width as this makes indexing much harder and it is the indexing that is going to provide the speed/scalability in the same way a hash allows you to have fast "random" access to a large array.
The basic unit of data you want to store consists of
WORD IN_DOCUMENT AT_POSITION
To efficiently store this you will want to store this unit of data in a compressed form. The compressed form you really want is one where the WORD becomes a WORD_ID and the IN_DOCUMENT becomes a DOC_ID. If you think about a search you want to input words to search for, thus the words table needs to rapidly convert a WORD to a WORD_ID. Using this we will look in our basic unit data table and get back some DOC_IDs which we then need to convert back to IN_DOCUMENT. Thus the tables you need are:
TABLE words word CHAR(32) PRIMARY KEY word_id INT TABLE docs doc_id INT PRIMARY KEY doc_name VARCHAR TABLE word_doc word_id INT doc_id INT pos INT
To complete this we must create an index on the word_doc table. This table as it stands is just a big array. When we create an index on it it effectively becomes a big hash that we can get rapid random access to
CREATE INDEX warpspeed ON word_doc (word_id)
With this structure in place you can then make a query into the word_doc table that will return the data you want. The basic query will be:
SELECT doc_id FROM word_doc WHERE word_id = ? INTERSECT SELECT doc_id FROM word_doc WHERE word_id = ?
This will work but will still be slow. The problem is that it is a big task and we can't use the index to go straight to the data we want. The solution to this is to predict and cache.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: Compress positive integers
by MimisIVI (Acolyte) on Apr 09, 2008 at 13:19 UTC | |
by tachyon-II (Chaplain) on Apr 09, 2008 at 14:59 UTC | |
by MimisIVI (Acolyte) on Apr 09, 2008 at 16:32 UTC | |
by tachyon-II (Chaplain) on Apr 09, 2008 at 17:38 UTC | |
by tachyon-II (Chaplain) on Apr 11, 2008 at 04:21 UTC | |
by MimisIVI (Acolyte) on Apr 11, 2008 at 15:29 UTC | |
|