in reply to Do you really want to use an array there?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Do you really want to use an array there?
by BrowserUk (Patriarch) on Apr 14, 2008 at 07:37 UTC | |
MimisIVI, if you mean vec, Tachyon-II pointed you at vec 5 days ago in Re^6: Compress positive integers. It's right there in the last line. Maybe you missed it? Anyhow, here is Elias Gamma coded in pure Perl:
And for reference, here is the results of my benchmark with that included:
Notice that PP_Elias achieves identical compression to Elias in C (as you'd expect :), but that it's twice as slow at packing, and nearly 10 times slower when unpacking. If you want to continue with using Elias, and can't build the C version yourself, I could let you have it pre-built (for win32), but I have to wonder why you would when W-BER is faster and achieves better compression? Also, I wonder if you saw Re^8: Byte allign compression in Perl.. where I demonstrated that you can have the DB do the selection for you using the schema I suggested way back when, in 0.312 of a second? For the record, I found a small optimisation in the schema that reduce that by a factor of 10, to 31 milliseconds. So the DB does the selection, sends you just the data you need to do your proximity calculations, and does it all faster than you could pack a single integer. Interested? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by MimisIVI (Acolyte) on Apr 14, 2008 at 14:13 UTC | |
For the SQL command that you propose i want to ask you for which server is appropriate because on MySQL there is no command for the intersection( i tried some inner join but the perfomance was very very very slow for 1GB dataset (250000 pages,Average document length : 602 words Number of unique words: 907806)... About your code for the Elias technique i have to say that it is 3 times faster than mine (Thanks one more time!!!)...The only reason why i want to use compression in my index is for perfomance reasons..that was my thougths untill now but it seems that i wasnt right..:(..since with the vec function i can decode 3000000 doc ids in 2 seconds and 10 milion in 6 secs!!!) as the below code shows..
In the above code i used 4 bytes for each doc id. I tried with a vector where i save the same number of doc ids but with only 1 byte for each doc id ( i saved only small numbers) and the time was completely the same..I cant understand why..DOes anyone?? | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Apr 14, 2008 at 15:40 UTC | |
..since with the vec function i can decode 3000000 doc ids in 2 seconds and 10 milion in 6 secs!!!) as the below code shows.. First off, using vec to pack 32-bit (Ie. byte, word and dword aligned) numbers is giving you a false impression of it's performance. It's when you start crossing those boundaries that the performance falls off sharply. If you wanted to just pack 32-bits on dword boundaries, pack 'V' (or N if your on a bigendian machine) is faster:
But neither gives you the compression you seek. About your code for the Elias technique i have to say that it is 3 times faster than mine But it is still slower than my $packed = pack 'w*', @numbers; and achieves far less compression. pack 'w', (BER) compression is built in, gives the best compression and speed. For the SQL command that you propose i want to ask you for which server is appropriate because on MySQL there is no command for the intersection( i tried some inner join but the perfomance was very very very slow for 1GB dataset (250000 pages,Average document length : 602 words Number of unique words: 907806)... I gave up on MySQL a long time ago because of it's limitations. I has improved markedly in recent versions with allowing subselects places where it never used to, and the addition of stored procedures and stuff but I still prefer Postgres. In particular, the pgAdmin III tool is excellent for tuning your queries. I'll try building a DB to match those numbers and let you know how the performance pans out, but even if it was 50 times slower than with (5000/554/15000), which it won't be, it will still be 100 times faster than having to decompress 25 times as much data as you need, then select the 4% you do in Perl. I'll let you know. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by MimisIVI (Acolyte) on Apr 14, 2008 at 16:57 UTC | |
by BrowserUk (Patriarch) on Apr 14, 2008 at 17:36 UTC | |
| |