Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re^4: Do you really want to use an array there?

by BrowserUk (Patriarch)
on Apr 14, 2008 at 15:40 UTC ( [id://680316]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Do you really want to use an array there?
in thread Do you really want to use an array there?

..since with the vec function i can decode 3000000 doc ids in 2 seconds and 10 milion in 6 secs!!!) as the below code shows..

First off, using vec to pack 32-bit (Ie. byte, word and dword aligned) numbers is giving you a false impression of it's performance. It's when you start crossing those boundaries that the performance falls off sharply. If you wanted to just pack 32-bits on dword boundaries, pack 'V' (or N if your on a bigendian machine) is faster:

C:\test>p1 cmpthese -3, { pack_32bit => q[ my $packed = pack 'V*', 1 .. 1e6 ], vec_32bit => q[ my $packed = ''; vec( $packed, $_, 32 ) = $_ for +1 .. 1e6 ] };; s/iter vec_32bit pack_32bit vec_32bit 5.58 -- -12% pack_32bit 4.94 13% --

But neither gives you the compression you seek.

About your code for the Elias technique i have to say that it is 3 times faster than mine

But it is still slower than my $packed = pack 'w*', @numbers; and achieves far less compression. pack 'w', (BER) compression is built in, gives the best compression and speed.

For the SQL command that you propose i want to ask you for which server is appropriate because on MySQL there is no command for the intersection( i tried some inner join but the perfomance was very very very slow for 1GB dataset (250000 pages,Average document length : 602 words Number of unique words: 907806)...

I gave up on MySQL a long time ago because of it's limitations. I has improved markedly in recent versions with allowing subselects places where it never used to, and the addition of stored procedures and stuff but I still prefer Postgres. In particular, the pgAdmin III tool is excellent for tuning your queries.

I'll try building a DB to match those numbers and let you know how the performance pans out, but even if it was 50 times slower than with (5000/554/15000), which it won't be, it will still be 100 times faster than having to decompress 25 times as much data as you need, then select the 4% you do in Perl. I'll let you know.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^5: Do you really want to use an array there?
by MimisIVI (Acolyte) on Apr 14, 2008 at 16:57 UTC
    First i want to ask you what do you mean with a false impression of it's performance??? I didnt get it..i count the time by using the normal way (time()) ..where is the false perfomance?>??

    In your above code you give benchmarks about the packing...what about unpacking??Which is the most important for me..the query responce time..that's what i want to improve ...

    Can you give please some benchmarks with the unpacking process,i am still not able to use appropriate this G....t functions(pack/unpack)...



    Thanks!!!!
      First i want to ask you what do you mean with a false impression of it's performance???

      I wasn't questioning your timing method. My point was that using vec on full 32-bit numbers is directly equivalent to using pack, but slower. Unpacking with vec is much slower:

      C:\test>p1 $packed = pack 'V*', 1 .. 1e6;; cmpthese -5, { unpack => q[ my @nums = unpack 'V*', $packed ], unvec => q[ my @nums; push @nums, vec( $packed, $_, 32 ) for 1 .. + 1e6 ], };; Rate unvec unpack unvec 1.45/s -- -45% unpack 2.64/s 81% --

      But even that belies the real problem. With Elias Gamma, you have to pack/unpack strings of bits that cross byte boundaries, and the only way to do that it to do them 1-bit at a time. That means calling vec for every bit in the string, rather than every 32-bits as you do in your test above. And once you start calling vec for every bit, things slow down dramatically. As you would expect as you are calling the built-in 32 times more frequently. By way of comparison, the equivalent of your test above, but done one bit at a time is:

      use strict; my $wektor = ''; for my $num ( 0 .. 9_999_999 ) { my $binNum = pack 'V', $num; vec( $wektor, $num * 32 + $_, 1 ) = vec( $binNum, $_, 1 ) for 1 .. + 32; } print "Vector's size: " . length( $wektor ) . " bytes\n"; my @vec; my $Aa = time(); for my $num ( 0 .. 9_999_999 ) { my $binNum = ''; vec( $binNum, $_, 1 ) = vec( $wektor, $num * 32 + $_, 1 ) for 1 .. + 32; push @vec, unpack 'V', $binNum; } print "unpack vector in \t", time() - $Aa, " secs...(Oh dear!!!)\n"; __END__ C:\test>junk0 Vector's size: 40000001 bytes unpack vector in 362 secs...(Oh dear!!!)

      But, you still seem to be completely ignoring pack 'w*', @nums, which gives better compression than Elias Gamma, and is faster than anything else to boot.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Look plz in the above thread what i wrote...

        The only reason why i want to use compression in my index is for perfomance reasons..that was my thougths untill now but it seems that i wasnt right..:(..since with the vec function i can decode 3000000 doc ids in 2 seconds and 10 milion in 6 secs!!!)

        Lets forget the Elias code..and use only the vec , unpack V* and w* to benchmark..and finaly to found the winner...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://680316]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2024-04-19 16:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found