..since with the vec function i can decode 3000000 doc ids in 2 seconds and 10 milion in 6 secs!!!) as the below code shows..

First off, using vec to pack 32-bit (Ie. byte, word and dword aligned) numbers is giving you a false impression of it's performance. It's when you start crossing those boundaries that the performance falls off sharply. If you wanted to just pack 32-bits on dword boundaries, pack 'V' (or N if your on a bigendian machine) is faster:

C:\test>p1 cmpthese -3, { pack_32bit => q[ my $packed = pack 'V*', 1 .. 1e6 ], vec_32bit => q[ my $packed = ''; vec( $packed, $_, 32 ) = $_ for +1 .. 1e6 ] };; s/iter vec_32bit pack_32bit vec_32bit 5.58 -- -12% pack_32bit 4.94 13% --

But neither gives you the compression you seek.

About your code for the Elias technique i have to say that it is 3 times faster than mine

But it is still slower than my $packed = pack 'w*', @numbers; and achieves far less compression. pack 'w', (BER) compression is built in, gives the best compression and speed.

For the SQL command that you propose i want to ask you for which server is appropriate because on MySQL there is no command for the intersection( i tried some inner join but the perfomance was very very very slow for 1GB dataset (250000 pages,Average document length : 602 words Number of unique words: 907806)...

I gave up on MySQL a long time ago because of it's limitations. I has improved markedly in recent versions with allowing subselects places where it never used to, and the addition of stored procedures and stuff but I still prefer Postgres. In particular, the pgAdmin III tool is excellent for tuning your queries.

I'll try building a DB to match those numbers and let you know how the performance pans out, but even if it was 50 times slower than with (5000/554/15000), which it won't be, it will still be 100 times faster than having to decompress 25 times as much data as you need, then select the 4% you do in Perl. I'll let you know.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re^4: Do you really want to use an array there? by BrowserUk
in thread Do you really want to use an array there? by deprecated

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.