I am just a hacker. Most of what I know I learnt from trying this, testing that, trying something else.

The rough schema suggested will sort of work at scale. It should be considerably faster than how you are trying to do it currently but still has issues.

The storage requirement is OK. The BIG table will take 4 bytes x 3 per record (3 ints). In rough terms your table will be around 240% the size of you total document size assuming 4 bytes per word average plus one byte for the space and pure ASCII documents. It will be less than this if the documents are .doc, .pdf, .htm etc as the textual content is less and that is what you use to index. Believe it or not (given your compression focus) it might be better to put in one "useless" INT to pad each record out to 16 bytes. This is becuase 16 bytes is a neat fit into 256/512/1024.... size disk sectors. Even if you don't pad your RDBMS may well add this padding for you - depends on the RDBMS. Will that make a difference? Dunno? Don't take my word for it. Make two big tables, with 12 and 16 byte records and benchmark it. It did used to make a difference. It may only speed the creation of the index (but for a big table this can take a long time). Given that everything rotates around access to this table you need to get it optimised as best you can first.

Anyway, the table will be big BUT .... it that does not really matter. The reason is indexing.

The simplest way to understand an index (and the whole O n O log n thing) is to recall the guess a number game: Guess a number between 0 and 128. You guess 64 (0+128)/2. Higher. You guess 96 (64+128)/2. Higher. You guess 112 (96+128)/2. Higher. You guess 120. Higher. You guess 124. Higher. You guess 126. Higher. You guess 127. Higher. 128!!!!. As opposed to getting the answer in (on average 64 tries) we got it in 7. Why 7. int(log2 128) = 7. This is actually the worst case, but the average case is 5-6 guesses so it is somewhat academic. If you just started at 0 and went up, or started at 128 and went down, on average we would take 64 guesses to get the right answer.

So iteration is 0.5 O n, the binary search algoritm is O log(2) n. Why is this good? Well if we go to guess a number between 1 and 1,000,000 the iterative solution takes 500,000 tries on average (1,000,000 worst case) but the binary search takes only 19 tries (worst case). Get it? O n is bad at scale. O log n basically does not care about scale. Binary searches are like indexes. In essence indexes make O n operations O log n. Indexes cost space and time to calculate but once you have them calculated you turn a 500,000 time unit operation into a < 20 time unit operation! O log n scales. O n scales poorly. O n^2 dies at scale.

To scale you must run O log n. O n might work if you have the horsepower and a smallish data set but if possible you do O n stuff in advance, cache and run an O log n system. You need caching far more than you need compression as the intersection problem is On*Ologn. What this basically means is that it will never scale if you try to run it real time. You need to cheat.

You will find a lot of useful stuff here and in the links.


In reply to Re^8: Compress positive integers by tachyon-II
in thread Compress positive integers by MimisIVI

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.