in reply to Byte allign compression in Perl..

For background see Re^9: Compress positive integers and the thread above which suggests an algorithm and the thread above which details the task.

Here is an implementation of the MSB serialisation strategy using fixed width 4 byte ints for simplicity.This task really asks to be implemented in C but can of course be done in Perl. << is a bitshift operator | is binary OR and & is binary AND.

use strict; use Data::Dumper; my $data = { 1 => [ 1..10 ], 2 => [ 11..20 ], 3 => [ 21..30 ], }; my $MSB = 1<<31; print "$MSB is MSB in decimal\n"; my $SET_MSB = pack "I",$MSB; printf "%s SET MASK in binary\n", (unpack "b32", $SET_MSB); my $UNSET_MSB = pack "I",$MSB-1; printf "%s UNSET MASK in binary\n", (unpack "b32", $UNSET_MSB); my $res = serialise($data); my $dat = unserialise($res); print Data::Dumper::Dumper($dat); sub serialise { my $data = shift; my $str = ''; for my $doc_id( keys %$data ) { $str .= (pack "I", $doc_id) | $SET_MSB; for my $pos ( @{$data->{$doc_id}} ) { $str .= (pack "I", $pos) & $UNSET_MSB; } } return $str; } sub unserialise { my $data = shift; my $dat = {}; my $doc_id = undef; for( my $i=0;$i<length $data; $i+=4) { my $int = unpack "I", (substr $data,$i); if ( $int > $MSB ) { $doc_id = unpack "I", ((pack "I", $int) & $UNSET_MSB); } else { push @{$dat->{$doc_id}}, $int; } } return $dat; }

Replies are listed 'Best First'.
Re^2: Byte allign compression in Perl..
by MimisIVI (Acolyte) on Apr 11, 2008 at 17:18 UTC
    Hi Tachyon...

    Your code works perfect and it is what exactly i want to do..The only that remains is to make some benchmarks to see if it is fast..

    Now to be sure what you did..

    You create a posting list for one Term and you pack it like below..
    For each Doc id you pack in an unsigned Integer and you put the doc flag at the begining and the same for the positions with the coresponding flag..

    So the structure is like that :
    Posting List: FLAGdocid1FLAGPos1..PosNFLAGdocId2FLAGPos1..PosN....e.t.c.

    Plz correct me if i am wrong...
    Thanks so much!!!

    I will info you very soon with the benchamrks..

      Not quite. In this example the data is a stream of fixed witdth 4 byte ints. To serialise we do 2 things: 1) set the MSB on the doc_id int and 2)ensure that the pos int DOES NOT have the MSB set. To deserialise we read each 4 byte int, see if the MSB is set (if so we have a doc_id), or not set (if not set it is a pos). The data looks like:

      MSBdoc_id1 pos1 pos2...posN MSBdoc_id2 pos1 pos2...posN MSBdoc_id3 blah

      You can make this work with gamma encoding but as noted the gamma encoding compression algorithm uses more space than a 4 byte int if the numbers to be encoded exceed a 2 byte int. As it says in the blurb it is good for encoding small ints only. The saving directly relates to how small the ints are. If they are really small it is worth doing.

        It seems that is very slow to decode the Posting List...

        I create a posting list with 30000 doc ids and only 2 word ids for each doc and took 35 secs to decode the posting list..
        ##### PACK THE POSTING LIST... my $str = ''; for my $doc_id( 0 .. 30000 ) { print "Doc id:\t",$doc_id,"\n"; $str .= (pack "I", $doc_id) | $SET_MSB; for my $pos ( 0 .. 2 ) { $str .= (pack "I", $pos) & $UNSET_MSB; } } print "pack time \t",time()-$Aa,"\n"; # UNPACK THE POSTING LIST my $Aaa=time(); my $dat = {}; my $doc_id = undef; for( my $i=0;$i<length $str; $i+=4) { my $int = unpack "I", (substr $str,$i); if ( $int > $MSB ) { $doc_id = unpack "I", ((pack "I", $int) & $UNSET_MSB); } else { push @{$dat->{$doc_id}}, $int; } } print "unpack time \t",time()-$Aaa,"\n";
        pack time 2 unpack time 35


        Did i make something wrong??