MimisIVI has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I recently implement the Byte allign compression and i want your opinion if i used the most efficient way...

Byte Allign Encode ...(compress just 20 docs for example..)
(i dont compress anything else except Dod IDS)
my $posting; for my $docid(1 .. 20) { $posting. = ByteEncode($docid); } #At the end i pack the whole bit string my $compressed = pack "b*",$posting; # i am not sure if its the best f +ormat... #Decode the compressed Bit string my $df = 20; #number of compressed docs.. my $bits=7; my $decode = unpack "b*",$posting; my @w;my $nr=0; my $binary=''; my $start=0; while() { #ckeck the MSB, if it is 1 then its the last byte for +the current Docid my $temp = substr($decode,0,$bits+1,''); # its the last one... if($temp=~ m/^1/) { substr($temp,0,1,''); $binary.=$temp; push @w,bin2dec($binary); undef $binary; $nr++; last if($nr==$df); } #else remove the MSB and keep the rest and go to the n +ext byte else { substr($temp,0,1,''); $binary.=$temp; } } return @w;

My question is how can i read from the bit string 1 byte or n Bits per time and how can i check the MSB from the packed format without to unpack it and work it with the substr..

Another thing that i dont know is how can i import delimiters in the bit string because i want to save for each doc id the corresponding positions for each term..

Like this way: Term1 -> DocID1PositionsDocID2Positions..e.t.c..

Thanks


Replies are listed 'Best First'.
Re: Byte allign compression in Perl..
by tachyon-II (Chaplain) on Apr 11, 2008 at 16:32 UTC
    For background see Re^9: Compress positive integers and the thread above which suggests an algorithm and the thread above which details the task.

    Here is an implementation of the MSB serialisation strategy using fixed width 4 byte ints for simplicity.This task really asks to be implemented in C but can of course be done in Perl. << is a bitshift operator | is binary OR and & is binary AND.

    use strict; use Data::Dumper; my $data = { 1 => [ 1..10 ], 2 => [ 11..20 ], 3 => [ 21..30 ], }; my $MSB = 1<<31; print "$MSB is MSB in decimal\n"; my $SET_MSB = pack "I",$MSB; printf "%s SET MASK in binary\n", (unpack "b32", $SET_MSB); my $UNSET_MSB = pack "I",$MSB-1; printf "%s UNSET MASK in binary\n", (unpack "b32", $UNSET_MSB); my $res = serialise($data); my $dat = unserialise($res); print Data::Dumper::Dumper($dat); sub serialise { my $data = shift; my $str = ''; for my $doc_id( keys %$data ) { $str .= (pack "I", $doc_id) | $SET_MSB; for my $pos ( @{$data->{$doc_id}} ) { $str .= (pack "I", $pos) & $UNSET_MSB; } } return $str; } sub unserialise { my $data = shift; my $dat = {}; my $doc_id = undef; for( my $i=0;$i<length $data; $i+=4) { my $int = unpack "I", (substr $data,$i); if ( $int > $MSB ) { $doc_id = unpack "I", ((pack "I", $int) & $UNSET_MSB); } else { push @{$dat->{$doc_id}}, $int; } } return $dat; }
      Hi Tachyon...

      Your code works perfect and it is what exactly i want to do..The only that remains is to make some benchmarks to see if it is fast..

      Now to be sure what you did..

      You create a posting list for one Term and you pack it like below..
      For each Doc id you pack in an unsigned Integer and you put the doc flag at the begining and the same for the positions with the coresponding flag..

      So the structure is like that :
      Posting List: FLAGdocid1FLAGPos1..PosNFLAGdocId2FLAGPos1..PosN....e.t.c.

      Plz correct me if i am wrong...
      Thanks so much!!!

      I will info you very soon with the benchamrks..

        Not quite. In this example the data is a stream of fixed witdth 4 byte ints. To serialise we do 2 things: 1) set the MSB on the doc_id int and 2)ensure that the pos int DOES NOT have the MSB set. To deserialise we read each 4 byte int, see if the MSB is set (if so we have a doc_id), or not set (if not set it is a pos). The data looks like:

        MSBdoc_id1 pos1 pos2...posN MSBdoc_id2 pos1 pos2...posN MSBdoc_id3 blah

        You can make this work with gamma encoding but as noted the gamma encoding compression algorithm uses more space than a 4 byte int if the numbers to be encoded exceed a 2 byte int. As it says in the blurb it is good for encoding small ints only. The saving directly relates to how small the ints are. If they are really small it is worth doing.

Re: Byte allign compression in Perl..
by BrowserUk (Patriarch) on Apr 11, 2008 at 19:07 UTC

    MimisIVI, I coded the Elias Gamma algorithm (and posted a correction to the wikipedia page), in C and compared its compresssion levels and performance with using split & join on the ASCII, and pack & unpack on the binary. Here are the results:

    c:\test>678848 Run with 15000 unique words in 5000 documents (Ave: 554 words/doc) ASCII uses 26235503 bytes Binary uses 16398152 bytes Elias uses 22057989 bytes 1 trial of Packing ascii (163.453s total) 1 trial of Unpacking ascii (14.564s total) 1 trial of Packing binary (222.252s total) 1 trial of Unpacking binary (7.203s total) 1 trial of Packing Elias (337.402s total) 1 trial of Unpacking Elias (14.141s total)

    As you can see, Elias Gamma, coded in C, runs slower than ascii (a lot slower for compression) and doesbn't achieve as good crompression as packing binary(*). And remember, Elias Gamma should achieve better compression than the byte-aligned Bueuttcher algorithm.

    Now, my C implementation may not be the quickest possible, though I done the obvious things to maximise it's performance, but it is at least 5 times faster than anything I could code in Perl

    There really is no way that either algorithm is going to have any major effect upon the performance of your alogorithm. In perl or C, any performance gained in read & transmission times, is completely negated by the time spent packing and unpacking. I cannot recommand more strongly, the logic of using a better schema, and so only reading & transferring the data you need, rather than compressing and transferring 25 more data than you need. 4% v 84%.

    I'm compressing the binary using a 16-bit template. This allows for document s upto 64k in length. Elias does allow you to safetly compress larger numbers, whereas the code below would break. But as tachyon-II pointed out elsewhere, once you start using Elias Gamma on numbers greater than 16-bit, the data gets larger, not smaller. There really is no mileage in using compression for this when redefining the schema can reduce your data transfers to 4% and that will speed up everything else in your program because you will have less sorting and filtering to do.

    Here is the benchmark code so that you can vary the numbers and do some runs yourself (assuming you have/can install Inline C):


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Hi BrowserUk,

      Interestingly perl's inate "w" pack template integer compression (as suggested by pc88mxer) not only compresses the data OK but it does not seem to do anything much to the speed (but my testing has been very unextensive). See Re^5: Byte allign compression in Perl.. and the related thread.

        Hi Guys,

        It seems that the straight unpack is much faster as the below benchmarks show...

        Decode and encode Posting List with 250000 doc ids and 3 words per doc..

        Hardware: Pentium 4 CPU 2.8 GHz, 1.5 GB RAM

        1.First Version with the substr:
        pack time:8s
        unpack time:A LOT..i couldnt wait..

        2.Second Version with the I template:
        pack time:8s
        unpack time:2s

        3.Third Version with the w template:
        pack time:3s
        unpack time:3s

        It seems tha we have some improvement thanks to you guys...but still i am not sutisfied



        I will import in the benchmarks the Elias Gamma and the bYte allign code that i implement but i have to find a way to distinguish the doc ids from the relative positions...

        Now lets talk about the shema that i use and the "useless" information that i handle in case where we have an AND query..

        Lets assume that the user gives the query "Perl pack unpack"

        First i use the lexicon to translate the terms into the corespondings word ids..Furthermore in lexicon i keep the DF (document frequency) for each term, so can do an efficient intersection by starting from the word with the smaller DF (which currently i dont :( )..
        For this example the word unpack has DF equal with 25000 docs and the pach 50000 DF. Now as the IR boook you can do efficient intersections via embeded skip pointers.This was the cause why i choose this shema,but th eproblem is that i dont use it right until now. I supposed to do it, i have to use the filesystem to save my indexes and very low level commands like read. I have to read a lot about the perl filesystem to see how can i point a part of the file where exist the info of the asked term and how can be fast as the DBMS indexes.

        Another reason why i dont choose your shema as you proposed is that the application that i am building is an indexing and searching tool where the user can index what ever he wants by crawling the web or his machine..so i cant save every index that the user builds in RAM thats why i am trying to find a efficient index shema which will be fast with no system optimizations..

        Its a very very big talk and pleasant for me if you will continue to comment this thread guys..

        P.S>One change that i am thinking to do in my shema is to replace the positons with the TF of each doc and to save the positions in other table..i have to think it more before start to implement it....

Re: Byte allign compression in Perl..
by pc88mxer (Vicar) on Apr 11, 2008 at 16:37 UTC
    It looks like this encoding method is very similar to the BER compressed integer format except that 0 instead of 1 is used to signal continuation into the next byte. So the following should work:
    sub byte_align_decode { my $bytes = shift; $bytes ^= "\x80" x length($bytes); unpack("w*", $bytes); } # from the example in the above URL print join(' ', byte_align_decode("\x06\xb8\x85\x0d\x0c\xb1")), "\n"; # emits: 824 5 214577

    The encoding routine is similar:

    sub byte_align_encode { my $bytes = pack("w*", @_); $bytes ^= "\x80" x length($bytes); }