Re^8: Do you really want to use an array there?

Ok. Here you go:

#! perl -slw
use strict;
use Benchmark qw[ cmpthese ];


our $packedV = pack 'V*', 1 .. 1e6;
our $packedW = pack 'w*', 1 .. 1e6;
our $packedVec = '';  vec( $packedVec, $_, 32 ) = $_ for 0 .. 1e6 - 1;

cmpthese -5, {
    unpackV   => q[ my @nums = unpack 'V*', $packedV; ],
    unpackW   => q[ my @nums = unpack 'w*', $packedW; ],
    unVec     => q[ 
        my @nums; push @nums, vec( $packedVec, $_, 32 ) for 0 .. 1e6 -
+ 1 
    ],
};

print "$_: ", length( do{ no strict; ${$_} } ) 
    for qw[ packedV packedW packedVec ];

__END__
C:\test>junk0
          Rate   unVec unpackV unpackW
unVec   1.87/s      --    -29%    -31%
unpackV 2.64/s     41%      --     -2%
unpackW 2.70/s     44%      2%      --

packedV:   4000000
packedW:   2983490
packedVec: 4000000
[download]

unpack 'w' is 44 % faster than vec and compresses the data to 75% to boot. Which means less time to transfer from the DB.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re^8: Do you really want to use an array there? Download Code

Replies are listed 'Best First'.
Re^9: Do you really want to use an array there? by MimisIVI (Acolyte) on Apr 14, 2008 at 18:37 UTC
Dont forget that the schema is like this: TERM1 -> Docid1PositionsDocid2Positions...... so you have to separate the values as tachyon-II said by use the MSB as flag or somehow else if you like... Tachyon-II code: ########################### PACK with W* my $MSB = 1 << 31; my $tests = 250_000; my $DEBUG =1; print "Doing $tests tests\n"; my $pack = time(); my $str = ''; for my $doc_id( 1 .. $tests ) { #printf "Doc id: %d\n", $doc_id if $DEBUG; $str .= pack "w", $doc_id+$MSB; for my $pos ( 0 .. 2 ) { $str .= pack "w", $pos; } } printf "pack time %d\n", time()-$pack; printf "length %d\n", length $str; print "PAck w's size: " . size( $str) . " bytes\n"; my $unpack = time(); my $dat = {}; my $doc_id = undef; for my $int (unpack "w", $str) { if ( $int > $MSB ) { $doc_id = $int - $MSB; #printf "\nDoc id: %d\t", $doc_id if $DEBUG; } else { push @{$dat->{$doc_id}}, $int; #print "$int\t" if $DEBUG; } } printf "\n\nunpack time %d\n",time()-$unpack; [download] In while you will have my own benchmarks too...	[reply] [d/l]
Re^10: Do you really want to use an array there? by MimisIVI (Acolyte) on Apr 14, 2008 at 20:43 UTC
Here is what i 've done..i change the schema by adding the TF(term frequency) for each term in each document like below: Term1->DocId1TFposDocId2TFposDocId2TFpos...... I add the tf so to distinguish the doc ids and the positions..i suppose there is other way to distinguish the docs and the pos without to add more info but until now i didnt find it... Here i save in the bit string 500000 documents where for each doc i keep the Tf and 4 positions... use strict; use Devel::Size qw(size); my $df=500000;my $tf=3; my $wektor = ''; my $packed = ''; my $nr=0; for(0 .. $df) { vec ($wektor, $nr++, 32) = $_; # DOC ID...... vec ($wektor, $nr++, 32) = $tf; # TF...... for(0 .. $tf) { vec ($wektor, $nr++, 32) = $_+10; # POSITIONS } } print "Vector's size: " . size( $wektor ) . " bytes\n"; #print $nr,"\n"; ###################### UNPACK VECTOR2..... my %vec; my $docID=0; my $tf=0; my $index=0; my $Aa=time(); for(0 .. $df) { $docID = vec ($wektor, $index++, 32); $tf = vec ($wektor, $index++, 32); $vec{$docID}=$tf; # print "Doc id: $docID\ttf: $tf\n"; for(0 .. $tf) { # print "\t\tpositions: ",vec ($wektor, $index++, 32),"\n"; vec ($wektor, $index++, 32); } } print "unpack vector in \t",time()-$Aa," secs...(oh Yeah!!!)\n"; Vector's size: 12000052 bytes unpack vector in 4 secs...(oh Yeah!!!) [download] As you can see from the code i save only the docId and the Tf in a hash without saving the positions..i am trying to find the appropriate structure to keep all this info.. Thats all for now ...i hope you find something faster....	[reply] [d/l]
Re^11: Do you really want to use an array there? by MimisIVI (Acolyte) on Apr 16, 2008 at 20:12 UTC
Finaly the winner is the vec function,but is more expencive on space that pack w.. .. use strict; use Benchmark qw( timethese cmpthese ); my $df=1000000;my $tf=3; our $packedW =''; our $packedVec =''; my $nr=0; foreach(0 .. $df) { $packedW .= pack 'w', $_; $packedW .= pack 'w', $tf; vec ($packedVec, $nr++, 32) = $_; # DOC ID...... vec ($packedVec, $nr++, 32) = $tf; # TF...... } cmpthese 10, { unpackW => q[ my $odd=0;my %pack;my $doc=0; foreach (unpack 'w*', $packedW) { if($odd%2!=0) { $pack{$doc}=$_;$odd++;next; } $doc=$_; $odd++; } ], unVec => q[ my $docID=0; my $tf=0; my $index=0; my %vec; foreach (0 .. 1000000) { $docID = vec ($packedVec, $index++, 32); $tf = vec ($packedVec, $index++, 32); $vec{$docID}=$tf; } ], }; print "$_: ", length( do{ no strict; ${$_} } ) for qw[ packedVec packedW ]; __END__ s/iter unpackW unVecNoGaP unpackW 21.7 -- -60% unVec 8.69 149% -- packedVec: 8000008 packedW: 3983492 [download] If you have any sugestion how can distinguish eficient the two difirent numbers(docID and TF) with the pack function will be great... Thanks for your help...	[reply] [d/l]