in reply to Re^5: Do you really want to use an array there?
in thread Do you really want to use an array there?

First i want to ask you what do you mean with a false impression of it's performance???

I wasn't questioning your timing method. My point was that using vec on full 32-bit numbers is directly equivalent to using pack, but slower. Unpacking with vec is much slower:

C:\test>p1 $packed = pack 'V*', 1 .. 1e6;; cmpthese -5, { unpack => q[ my @nums = unpack 'V*', $packed ], unvec => q[ my @nums; push @nums, vec( $packed, $_, 32 ) for 1 .. + 1e6 ], };; Rate unvec unpack unvec 1.45/s -- -45% unpack 2.64/s 81% --

But even that belies the real problem. With Elias Gamma, you have to pack/unpack strings of bits that cross byte boundaries, and the only way to do that it to do them 1-bit at a time. That means calling vec for every bit in the string, rather than every 32-bits as you do in your test above. And once you start calling vec for every bit, things slow down dramatically. As you would expect as you are calling the built-in 32 times more frequently. By way of comparison, the equivalent of your test above, but done one bit at a time is:

use strict; my $wektor = ''; for my $num ( 0 .. 9_999_999 ) { my $binNum = pack 'V', $num; vec( $wektor, $num * 32 + $_, 1 ) = vec( $binNum, $_, 1 ) for 1 .. + 32; } print "Vector's size: " . length( $wektor ) . " bytes\n"; my @vec; my $Aa = time(); for my $num ( 0 .. 9_999_999 ) { my $binNum = ''; vec( $binNum, $_, 1 ) = vec( $wektor, $num * 32 + $_, 1 ) for 1 .. + 32; push @vec, unpack 'V', $binNum; } print "unpack vector in \t", time() - $Aa, " secs...(Oh dear!!!)\n"; __END__ C:\test>junk0 Vector's size: 40000001 bytes unpack vector in 362 secs...(Oh dear!!!)

But, you still seem to be completely ignoring pack 'w*', @nums, which gives better compression than Elias Gamma, and is faster than anything else to boot.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^7: Do you really want to use an array there?
by MimisIVI (Acolyte) on Apr 14, 2008 at 17:51 UTC
    Look plz in the above thread what i wrote...

    The only reason why i want to use compression in my index is for perfomance reasons..that was my thougths untill now but it seems that i wasnt right..:(..since with the vec function i can decode 3000000 doc ids in 2 seconds and 10 milion in 6 secs!!!)

    Lets forget the Elias code..and use only the vec , unpack V* and w* to benchmark..and finaly to found the winner...

      Ok. Here you go:

      #! perl -slw use strict; use Benchmark qw[ cmpthese ]; our $packedV = pack 'V*', 1 .. 1e6; our $packedW = pack 'w*', 1 .. 1e6; our $packedVec = ''; vec( $packedVec, $_, 32 ) = $_ for 0 .. 1e6 - 1; cmpthese -5, { unpackV => q[ my @nums = unpack 'V*', $packedV; ], unpackW => q[ my @nums = unpack 'w*', $packedW; ], unVec => q[ my @nums; push @nums, vec( $packedVec, $_, 32 ) for 0 .. 1e6 - + 1 ], }; print "$_: ", length( do{ no strict; ${$_} } ) for qw[ packedV packedW packedVec ]; __END__ C:\test>junk0 Rate unVec unpackV unpackW unVec 1.87/s -- -29% -31% unpackV 2.64/s 41% -- -2% unpackW 2.70/s 44% 2% -- packedV: 4000000 packedW: 2983490 packedVec: 4000000

      unpack 'w' is 44 % faster than vec and compresses the data to 75% to boot. Which means less time to transfer from the DB.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Dont forget that the schema is like this:


        TERM1 -> Docid1PositionsDocid2Positions......


        so you have to separate the values as tachyon-II said by use the MSB as flag or somehow else if you like...

        Tachyon-II code:
        ########################### PACK with W* my $MSB = 1 << 31; my $tests = 250_000; my $DEBUG =1; print "Doing $tests tests\n"; my $pack = time(); my $str = ''; for my $doc_id( 1 .. $tests ) { #printf "Doc id: %d\n", $doc_id if $DEBUG; $str .= pack "w", $doc_id+$MSB; for my $pos ( 0 .. 2 ) { $str .= pack "w", $pos; } } printf "pack time %d\n", time()-$pack; printf "length %d\n", length $str; print "PAck w*'s size: " . size( $str) . " bytes\n"; my $unpack = time(); my $dat = {}; my $doc_id = undef; for my $int (unpack "w*", $str) { if ( $int > $MSB ) { $doc_id = $int - $MSB; #printf "\nDoc id: %d\t", $doc_id if $DEBUG; } else { push @{$dat->{$doc_id}}, $int; #print "$int\t" if $DEBUG; } } printf "\n\nunpack time %d\n",time()-$unpack;

        In while you will have my own benchmarks too...