comment on

You don't seem to be listening. What you are trying to do will never work at scale. You need to do all the heavy lifting on the RDBMS side. The heart of the problem is this:

column Posting List ==> "10 20;3 21 2; 43 100;105;......
[download]

The core problem with this format is that you MUST pull it back into perl to manipulate it. The more documents you have and the more words per document the more data you need to pull to process your query. This will never scale. Not today, not tomorrow, not never. Compression will only vaguely paper over the cracks, if it helps at all (because of the overhead price).

The table format suggested by BrowerUk is not optimal but is along the lines of what you want as it indicates the need for 3 basic tables. For best speed you want to avoid BLOB and VARCHAR ie anything that is not fixed width as this makes indexing much harder and it is the indexing that is going to provide the speed/scalability in the same way a hash allows you to have fast "random" access to a large array.

The basic unit of data you want to store consists of

WORD
IN_DOCUMENT
AT_POSITION
[download]

To efficiently store this you will want to store this unit of data in a compressed form. The compressed form you really want is one where the WORD becomes a WORD_ID and the IN_DOCUMENT becomes a DOC_ID. If you think about a search you want to input words to search for, thus the words table needs to rapidly convert a WORD to a WORD_ID. Using this we will look in our basic unit data table and get back some DOC_IDs which we then need to convert back to IN_DOCUMENT. Thus the tables you need are:

TABLE words
    word    CHAR(32) PRIMARY KEY
    word_id INT

TABLE docs
    doc_id   INT PRIMARY KEY
    doc_name VARCHAR

TABLE word_doc
    word_id  INT
    doc_id   INT
    pos      INT
[download]

To complete this we must create an index on the word_doc table. This table as it stands is just a big array. When we create an index on it it effectively becomes a big hash that we can get rapid random access to

CREATE INDEX warpspeed ON word_doc (word_id)
[download]

With this structure in place you can then make a query into the word_doc table that will return the data you want. The basic query will be:

SELECT doc_id FROM word_doc WHERE word_id = ?
INTERSECT
SELECT doc_id FROM word_doc WHERE word_id = ?
[download]

This will work but will still be slow. The problem is that it is a big task and we can't use the index to go straight to the data we want. The solution to this is to predict and cache.

In reply to Re^4: Compress positive integers by tachyon-II
in thread Compress positive integers by MimisIVI

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.