in reply to Re: Byte allign compression in Perl..
in thread Byte allign compression in Perl..

Hi Tachyon...

Your code works perfect and it is what exactly i want to do..The only that remains is to make some benchmarks to see if it is fast..

Now to be sure what you did..

You create a posting list for one Term and you pack it like below..
For each Doc id you pack in an unsigned Integer and you put the doc flag at the begining and the same for the positions with the coresponding flag..

So the structure is like that :
Posting List: FLAGdocid1FLAGPos1..PosNFLAGdocId2FLAGPos1..PosN....e.t.c.

Plz correct me if i am wrong...
Thanks so much!!!

I will info you very soon with the benchamrks..
  • Comment on Re^2: Byte allign compression in Perl..

Replies are listed 'Best First'.
Re^3: Byte allign compression in Perl..
by tachyon-II (Chaplain) on Apr 11, 2008 at 17:34 UTC

    Not quite. In this example the data is a stream of fixed witdth 4 byte ints. To serialise we do 2 things: 1) set the MSB on the doc_id int and 2)ensure that the pos int DOES NOT have the MSB set. To deserialise we read each 4 byte int, see if the MSB is set (if so we have a doc_id), or not set (if not set it is a pos). The data looks like:

    MSBdoc_id1 pos1 pos2...posN MSBdoc_id2 pos1 pos2...posN MSBdoc_id3 blah

    You can make this work with gamma encoding but as noted the gamma encoding compression algorithm uses more space than a 4 byte int if the numbers to be encoded exceed a 2 byte int. As it says in the blurb it is good for encoding small ints only. The saving directly relates to how small the ints are. If they are really small it is worth doing.

      It seems that is very slow to decode the Posting List...

      I create a posting list with 30000 doc ids and only 2 word ids for each doc and took 35 secs to decode the posting list..
      ##### PACK THE POSTING LIST... my $str = ''; for my $doc_id( 0 .. 30000 ) { print "Doc id:\t",$doc_id,"\n"; $str .= (pack "I", $doc_id) | $SET_MSB; for my $pos ( 0 .. 2 ) { $str .= (pack "I", $pos) & $UNSET_MSB; } } print "pack time \t",time()-$Aa,"\n"; # UNPACK THE POSTING LIST my $Aaa=time(); my $dat = {}; my $doc_id = undef; for( my $i=0;$i<length $str; $i+=4) { my $int = unpack "I", (substr $str,$i); if ( $int > $MSB ) { $doc_id = unpack "I", ((pack "I", $int) & $UNSET_MSB); } else { push @{$dat->{$doc_id}}, $int; } } print "unpack time \t",time()-$Aaa,"\n";
      pack time 2 unpack time 35


      Did i make something wrong??

        Not really. Did I claim it was optimised! You wanted to use substring so I put it in. It works a lot better without it unpacking the ints straight into a list to iterate over rather than using substr.

        BTW print is very slow so never benchmark with debugging prints in action as your results will be meaningless. Try this (unpacking is now 4 times as fast as packing). Uncomment the commented out prints if you want to make sure it is still working (and the $DEBUG). print "blah" if $DEBUG works fine for testing but that is a helluva lot of ifs you put into your benchmark if you don't comment out the prints.

        my $MSB = 1<<31; my $SET_MSB = pack "I",$MSB; my $UNSET_MSB = pack "I",$MSB-1; my $tests = 250_000; # my $DEBUG = 1; print "Doing $tests tests\n"; my $start = time(); my $str = ''; for my $doc_id( 1 .. $tests ) { #printf "Doc id: %d\n", $doc_id if $DEBUG; $str .= (pack "I", $doc_id) | $SET_MSB; for my $pos ( 0 .. 2 ) { $str .= (pack "I", $pos) & $UNSET_MSB; } } printf "pack time %d\n", time()-$start; printf "length %d\n", length $str; my $unpack = time(); my $dat = {}; my $doc_id = undef; for my $int (unpack "I*", $str) { if ( $int > $MSB ) { $doc_id = unpack "I", ((pack "I", $int) & $UNSET_MSB); #printf "\nDoc id: %d\t", $doc_id if $DEBUG; } else { push @{$dat->{$doc_id}}, $int; #print "$int\t" if $DEBUG; } } printf "unpack time %d\n",time()-$unpack; __DATA__ C:\>serialize.pl Doing 250000 tests pack time 4 length 4000000 unpack time 1 C:\>

        Using perls inate integer compression pack template "w" we can also compress without hurting our speed much. This gives you a compression that depends on just how many small ints follow the doc_id. Using this compression a ~4 byte doc_id range will take 5 bytes but a pos of <128 will take 1 byte, 128^2 (16384) will take 2 bytes, 128^3 (2097152) will take 3 bytes etc.

        In this scheme we add a number (still effectively the MSB) to the doc_id so any number we get back over this is a doc_id and any number under this is a pos. We let perl (in C) handle the tricky details. Using your dataset the compression is 50% but it does much better than this if you have quite a lot of small pos numbers per doc_id as they don't take much space. For example 1 doc_id and 11 pos ints < 128 take the same space as 1 doc_id and 3 pos ints < 2^31 in the old schema.

        Please note this caveat. We do no error checking when serialising as it is expensive, however if you have a pos that is > MSB it will be interpretted as a doc_id. Also doc_id = 0 is not supported due to the doc_id > MSB test. Also because integers are 4 bytes if we have a doc_id > 2^31 when we add 2^31 to it we will hit the issue of 2^32 => 0 due to overflow wrap as this number is 1."0"x32 in binary. Thus is we have a doc_id > MSB it will get misinterpreted as a pos. This is also not checked due to 1) expense and 2) this code can't deal with it (it might work if you compile perl to support 64 bit ints but I can't be bothered to check).

        my $MSB = 1<<31; my $tests = 250_000; my $DEBUG =1; print "Doing $tests tests\n"; my $pack = time(); my $str = ''; for my $doc_id( 1 .. $tests ) { #printf "Doc id: %d\n", $doc_id if $DEBUG; $str .= pack "w", $doc_id+$MSB; for my $pos ( 0 .. 2 ) { $str .= pack "w", $pos; } } printf "pack time %d\n", time()-$pack; printf "length %d\n", length $str; my $unpack = time(); my $dat = {}; my $doc_id = undef; for my $int (unpack "w*", $str) { if ( $int > $MSB ) { $doc_id = $int - $MSB; #printf "\nDoc id: %d\t", $doc_id if $DEBUG; } else { push @{$dat->{$doc_id}}, $int; #print "$int\t" if $DEBUG; } } printf "\n\nunpack time %d\n",time()-$unpack; __DATA__ C:\>serialize.pl Doing 250000 tests pack time 2 length 2000000 unpack time 2 C:\>