Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^3: Do you really want to use an array there?

by MimisIVI (Acolyte)
on Apr 14, 2008 at 14:13 UTC ( [id://680291]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Do you really want to use an array there?
in thread Do you really want to use an array there?

No my friend , i didnt missed the Tachyon comment about the vec function..but the deprecated's code help me to understand what is going on with this function..

For the SQL command that you propose i want to ask you for which server is appropriate because on MySQL there is no command for the intersection( i tried some inner join but the perfomance was very very very slow for 1GB dataset (250000 pages,Average document length : 602 words Number of unique words: 907806)...

About your code for the Elias technique i have to say that it is 3 times faster than mine (Thanks one more time!!!)...The only reason why i want to use compression in my index is for perfomance reasons..that was my thougths untill now but it seems that i wasnt right..:(..since with the vec function i can decode 3000000 doc ids in 2 seconds and 10 milion in 6 secs!!!) as the below code shows..

use strict; use Devel::Size qw(size); my $wektor = ''; for(0 .. 10000000) { vec ($wektor, $_, 32) = $_; } print "Vector's size: " . size( $wektor ) . " bytes\n"; my @vec; my $Aa=time(); for(0 .. 10000000) { push @vec,vec ($wektor, $_, 32); } print "unpack vector in \t",time()-$Aa," secs...(oh Yeah!!!)\n";
Size of vector is 40000032 bytes unpack vector in 6 secs...(oh Yeah!!!)
In the above code i used 4 bytes for each doc id. I tried with a vector where i save the same number of doc ids but with only 1 byte for each doc id ( i saved only small numbers) and the time was completely the same..I cant understand why..DOes anyone??

Replies are listed 'Best First'.
Re^4: Do you really want to use an array there?
by BrowserUk (Patriarch) on Apr 14, 2008 at 15:40 UTC
    ..since with the vec function i can decode 3000000 doc ids in 2 seconds and 10 milion in 6 secs!!!) as the below code shows..

    First off, using vec to pack 32-bit (Ie. byte, word and dword aligned) numbers is giving you a false impression of it's performance. It's when you start crossing those boundaries that the performance falls off sharply. If you wanted to just pack 32-bits on dword boundaries, pack 'V' (or N if your on a bigendian machine) is faster:

    C:\test>p1 cmpthese -3, { pack_32bit => q[ my $packed = pack 'V*', 1 .. 1e6 ], vec_32bit => q[ my $packed = ''; vec( $packed, $_, 32 ) = $_ for +1 .. 1e6 ] };; s/iter vec_32bit pack_32bit vec_32bit 5.58 -- -12% pack_32bit 4.94 13% --

    But neither gives you the compression you seek.

    About your code for the Elias technique i have to say that it is 3 times faster than mine

    But it is still slower than my $packed = pack 'w*', @numbers; and achieves far less compression. pack 'w', (BER) compression is built in, gives the best compression and speed.

    For the SQL command that you propose i want to ask you for which server is appropriate because on MySQL there is no command for the intersection( i tried some inner join but the perfomance was very very very slow for 1GB dataset (250000 pages,Average document length : 602 words Number of unique words: 907806)...

    I gave up on MySQL a long time ago because of it's limitations. I has improved markedly in recent versions with allowing subselects places where it never used to, and the addition of stored procedures and stuff but I still prefer Postgres. In particular, the pgAdmin III tool is excellent for tuning your queries.

    I'll try building a DB to match those numbers and let you know how the performance pans out, but even if it was 50 times slower than with (5000/554/15000), which it won't be, it will still be 100 times faster than having to decompress 25 times as much data as you need, then select the 4% you do in Perl. I'll let you know.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      First i want to ask you what do you mean with a false impression of it's performance??? I didnt get it..i count the time by using the normal way (time()) ..where is the false perfomance?>??

      In your above code you give benchmarks about the packing...what about unpacking??Which is the most important for me..the query responce time..that's what i want to improve ...

      Can you give please some benchmarks with the unpacking process,i am still not able to use appropriate this G....t functions(pack/unpack)...



      Thanks!!!!
        First i want to ask you what do you mean with a false impression of it's performance???

        I wasn't questioning your timing method. My point was that using vec on full 32-bit numbers is directly equivalent to using pack, but slower. Unpacking with vec is much slower:

        C:\test>p1 $packed = pack 'V*', 1 .. 1e6;; cmpthese -5, { unpack => q[ my @nums = unpack 'V*', $packed ], unvec => q[ my @nums; push @nums, vec( $packed, $_, 32 ) for 1 .. + 1e6 ], };; Rate unvec unpack unvec 1.45/s -- -45% unpack 2.64/s 81% --

        But even that belies the real problem. With Elias Gamma, you have to pack/unpack strings of bits that cross byte boundaries, and the only way to do that it to do them 1-bit at a time. That means calling vec for every bit in the string, rather than every 32-bits as you do in your test above. And once you start calling vec for every bit, things slow down dramatically. As you would expect as you are calling the built-in 32 times more frequently. By way of comparison, the equivalent of your test above, but done one bit at a time is:

        use strict; my $wektor = ''; for my $num ( 0 .. 9_999_999 ) { my $binNum = pack 'V', $num; vec( $wektor, $num * 32 + $_, 1 ) = vec( $binNum, $_, 1 ) for 1 .. + 32; } print "Vector's size: " . length( $wektor ) . " bytes\n"; my @vec; my $Aa = time(); for my $num ( 0 .. 9_999_999 ) { my $binNum = ''; vec( $binNum, $_, 1 ) = vec( $wektor, $num * 32 + $_, 1 ) for 1 .. + 32; push @vec, unpack 'V', $binNum; } print "unpack vector in \t", time() - $Aa, " secs...(Oh dear!!!)\n"; __END__ C:\test>junk0 Vector's size: 40000001 bytes unpack vector in 362 secs...(Oh dear!!!)

        But, you still seem to be completely ignoring pack 'w*', @nums, which gives better compression than Elias Gamma, and is faster than anything else to boot.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://680291]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-03-29 13:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found