Re: Byte allign compression in Perl..

MimisIVI, I coded the Elias Gamma algorithm (and posted a correction to the wikipedia page), in C and compared its compresssion levels and performance with using split & join on the ASCII, and pack & unpack on the binary. Here are the results:

c:\test>678848
Run with 15000 unique words in 5000 documents (Ave: 554 words/doc)

ASCII  uses 26235503 bytes
Binary uses 16398152 bytes
Elias  uses 22057989 bytes

1 trial of Packing ascii (163.453s total)
1 trial of Unpacking ascii (14.564s total)
1 trial of Packing binary (222.252s total)
1 trial of Unpacking binary (7.203s total)
1 trial of Packing Elias (337.402s total)
1 trial of Unpacking Elias (14.141s total)
[download]

As you can see, Elias Gamma, coded in C, runs slower than ascii (a lot slower for compression) and doesbn't achieve as good crompression as packing binary(*). And remember, Elias Gamma should achieve better compression than the byte-aligned Bueuttcher algorithm.

Now, my C implementation may not be the quickest possible, though I done the obvious things to maximise it's performance, but it is at least 5 times faster than anything I could code in Perl

There really is no way that either algorithm is going to have any major effect upon the performance of your alogorithm. In perl or C, any performance gained in read & transmission times, is completely negated by the time spent packing and unpacking. I cannot recommand more strongly, the logic of using a better schema, and so only reading & transferring the data you need, rather than compressing and transferring 25 more data than you need. 4% v 84%.

I'm compressing the binary using a 16-bit template. This allows for document s upto 64k in length. Elias does allow you to safetly compress larger numbers, whereas the code below would break. But as tachyon-II pointed out elsewhere, once you start using Elias Gamma on numbers greater than 16-bit, the data gets larger, not smaller. There really is no mileage in using compression for this when redefining the schema can reduce your data transfers to 4% and that will speed up everything else in your program because you will have less sorting and filtering to do.

Here is the benchmark code so that you can vary the numbers and do some runs yourself (assuming you have/can install Inline C):

#! perl -slw
use strict;
use Inline C => 'DATA', NAME => '_678848', CLEAN_AFTER_BUILD => 0;
use Benchmark::Timer; my $T = new Benchmark::Timer;
use List::Util qw[ sum ];
use Math::Random::MT qw[ rand srand ];
use Data::Dump qw[ pp ]; $Data::Dump::MAX_WIDTH = 1000;

$|++; srand 1;

use constant NO_UNIQUE_WORDS    => 15000;
use constant NO_OF_DOCS         => 5000;
use constant WORDS_PER_DOC      => 554;

## Build some test data
my @words = do{ local @ARGV = 'words'; <> };
close ARGV;
chomp @words;
@words = @words[ map{ int( rand scalar @words ) } 1 .. NO_UNIQUE_WORDS
+ ];

printf "Run with %s unique words in %s documents (Ave: %s words/doc)\n
+\n",
    NO_UNIQUE_WORDS, NO_OF_DOCS, WORDS_PER_DOC;

my %index;
for my $docId ( 1 .. NO_OF_DOCS ) {
    my $pos = 0;
    my @doc = @words[ map{ int rand scalar @words } 1 .. WORDS_PER_DOC
+ ];
    for my $w ( 0 .. $#doc ) {
        my $word = $doc[ $w ];
        push @{ $index{ $word }{ $docId } }, $pos;
        $pos += 1 + length $word;
    }
}
#pp \%index; <>;

## Building the posting lists in ASCII
$T->start( my $label = 'Packing ascii' );
my @postings = map {
    my $wordHash = $_;
    join ' ', map{
        "$_:" . join ';', @{ $wordHash->{ $_ } };
    } keys %{ $wordHash };
} values %index;
$T->stop( $label );
#pp \@postings; <>;

printf "ASCII  uses %d bytes\n", sum map length, @postings;

## Unpack the ASCII
$T->start( $label = 'Unpacking ascii' );
my $wordIndex = 0;
for my $wordData ( @postings ) {
#    printf "\n$words[ $wordIndex ++ ]: ";
    for my $docData ( split ' ', $wordData ) {
        my( $doc, @posns ) = split '[:;]', $docData;
#        printf "$doc:%s ", join ';', @posns;
    }
}
$T->stop( $label );
undef @postings;

## Doing them in binary
$T->start( $label = 'Packing binary' );
my @binPostings = map {
    my $wordHash = $_;
    pack '(v/a*)*', map {
        pack 'v*', $_, @{ $wordHash->{ $_ } };
    } keys %{ $wordHash };
} values %index;
$T->stop( $label );

printf "Binary uses %d bytes\n", sum map length, @binPostings;

## Unpack the packed binary
$T->start( $label = 'Unpacking binary' );
$wordIndex = 0;
for my $wordData ( @binPostings ) {
#    printf "\n$words[ $wordIndex ++ ]: ";
    for my $docData ( unpack '(v/a*)*', $wordData ) {
        my( $doc, @posns ) = unpack 'v*', $docData;
#        printf "$doc:%s ", join ';', @posns;
    }
}
$T->stop( $label );
undef @binPostings;

## Elias Gamma Encoding
$T->start( $label = 'Packing Elias' );
my @packBinPostings = map {
    my $wordHash = $_;
    pack '(v/a*)*', map {
        packEm( $_, @{ $wordHash->{ $_ } } );
    } keys %{ $wordHash };
} values %index;
$T->stop( $label );

printf "Elias  uses %d bytes\n", sum map length, @packBinPostings;

## Unpack the packed binary
$T->start( $label = 'Unpacking Elias' );
$wordIndex = 0;
for my $wordData ( @packBinPostings ) {
#    printf "\n$words[ $wordIndex ++ ]: ";
    for my $docData ( unpack '(v/a*)*', $wordData ) {
        my( $doc, @posns ) = unpackEm( $docData );
#        printf "$doc:%s ", join ';', @posns;
    }
}
$T->stop( $label );

$T->report;

__DATA__
c:\test>678848
Run with 15000 unique words in 5000 documents (Ave: 554 words/doc)

ASCII  uses 26235503 bytes
Binary uses 16398152 bytes
Elias  uses 22057989 bytes

1 trial of Packing ascii (163.453s total)
1 trial of Unpacking ASCII (14.564s total)
1 trial of Packing binary (222.252s total)
1 trial of Unpacking binary (7.203s total)
1 trial of Packing Elias (337.402s total)
1 trial of Unpacking Elias (14.141s total)

c:\test>678848
Run with 15000 unique words in 5000 documents (Ave: 554 words/doc)

ASCII  uses 26237329 bytes
Binary uses 16396556 bytes
Elias  uses 22065316 bytes

1 trial of Packing ascii (167.313s total)
1 trial of Unpacking ASCII (14.782s total)
1 trial of Packing binary (247.297s total)
1 trial of Unpacking binary (7.438s total)
1 trial of Packing Elias (378.662s total)
1 trial of Unpacking Elias (14.906s total)

__C__
#include <math.h>
#define XS_Vars         dXSARGS
#define XS_Items        items
#define XS_Item(x)      ST(x)
#define XS_Reset        sp = mark
#define XS_Push(x)      XPUSHs(x)
#define XS_Done         PUTBACK
#define XS_Return(x)    XSRETURN(x)
#define XS_Void         XSRETURN(0)
#define TRACE printf( "%s:%d\n", __FILE__, __LINE__ );

_inline double log2( double n ) {  return log( n ) / log( 2 ); }

_inline void setBit( register U8 *bs, U32 offset ) {
    *( bs + ( offset / 8 ) ) |= (U8)( 1 << ( offset % 8 ) );
}

_inline U8 getBit( register U8 *bs, U32 offset ) {
    return *( bs + ( offset / 8 ) ) & (U8)( 1 << ( offset % 8 ) ) ? 1 
+: 0;
}

void packEm ( SV *dummy, ... ) {
    XS_Vars;
    U32 in = 0, out = 0, i = 0;
    register U8 *packed;
    Newz( 0xfaece5, packed, XS_Items*4, U8 );

    for( in = 0; in < (U32)XS_Items; in++ ) {
        U32 num = SvUV( XS_Item( in ) );
        U32 len = (U32)log2( num );
        out += len;
        setBit( packed, out++ );
        for( i = 0; i < len; i++ ) {
            if( num & ( 1 << i ) )
                setBit( packed, out );
            out++;
        }

    }
    out = ( out + 8 ) / 8;
    XS_Reset;
    XS_Push( sv_2mortal( newSVpvn( packed, out ) ) );
    Safefree( packed );
    XS_Done;
}

void unpackEm( SV * svPacked ) {
    XS_Vars;
    U32 bits   = SvCUR( svPacked ) * 8;
    register U8 *packed = SvPVX( svPacked );
    U32 in = 0, i = 0;

    XS_Reset;

    while( in < bits ) {
        U32 len = 0;
        U32 num = 0;

        while( in < bits && !getBit( packed, in++ ) ) len++;
        if( in == bits ) break;

        for( i = 0; i < len; i++ ) {
            if( getBit( packed, in++ ) )
                num |= ( 1 << i );
        }
        num |= ( 1 << len );
        XS_Push( sv_2mortal( newSVuv( num ) ) );
    }
    XS_Done;
    return;
}
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re: Byte allign compression in Perl.. Select or Download Code

Replies are listed 'Best First'.
Re^2: Byte allign compression in Perl.. by tachyon-II (Chaplain) on Apr 12, 2008 at 03:54 UTC
Hi BrowserUk, Interestingly perl's inate "w" pack template integer compression (as suggested by pc88mxer) not only compresses the data OK but it does not seem to do anything much to the speed (but my testing has been very unextensive). See Re^5: Byte allign compression in Perl.. and the related thread.	[reply]
Re^3: Byte allign compression in Perl.. by MimisIVI (Acolyte) on Apr 12, 2008 at 16:49 UTC
Hi Guys, It seems that the straight unpack is much faster as the below benchmarks show... Decode and encode Posting List with 250000 doc ids and 3 words per doc.. Hardware: Pentium 4 CPU 2.8 GHz, 1.5 GB RAM 1.First Version with the substr: pack time:8s unpack time:A LOT..i couldnt wait.. 2.Second Version with the I template: pack time:8s unpack time:2s 3.Third Version with the w template: pack time:3s unpack time:3s It seems tha we have some improvement thanks to you guys...but still i am not sutisfied I will import in the benchmarks the Elias Gamma and the bYte allign code that i implement but i have to find a way to distinguish the doc ids from the relative positions... Now lets talk about the shema that i use and the "useless" information that i handle in case where we have an AND query.. Lets assume that the user gives the query "Perl pack unpack" First i use the lexicon to translate the terms into the corespondings word ids..Furthermore in lexicon i keep the DF (document frequency) for each term, so can do an efficient intersection by starting from the word with the smaller DF (which currently i dont :( ).. For this example the word unpack has DF equal with 25000 docs and the pach 50000 DF. Now as the IR boook you can do efficient intersections via embeded skip pointers.This was the cause why i choose this shema,but th eproblem is that i dont use it right until now. I supposed to do it, i have to use the filesystem to save my indexes and very low level commands like read. I have to read a lot about the perl filesystem to see how can i point a part of the file where exist the info of the asked term and how can be fast as the DBMS indexes. Another reason why i dont choose your shema as you proposed is that the application that i am building is an indexing and searching tool where the user can index what ever he wants by crawling the web or his machine..so i cant save every index that the user builds in RAM thats why i am trying to find a efficient index shema which will be fast with no system optimizations.. Its a very very big talk and pleasant for me if you will continue to comment this thread guys.. P.S>One change that i am thinking to do in my shema is to replace the positons with the TF of each doc and to save the positions in other table..i have to think it more before start to implement it....	[reply]
Re^4: Byte allign compression in Perl.. by tachyon-II (Chaplain) on Apr 13, 2008 at 04:56 UTC
I really think you need to drop your attachment to Elias gamma encoding. Even coded in C it is slower than either binary v/a or BER AND uses more space. While the space issue is data dependent you do get byte aligned data with either binary v/a or BER. Elias is not byte aligned so more difficult to index as the index must be byte plus a 0-7 bit offset. Not only is it likely to be slower and take more space (as shown by our testing) it is also harder to use. Lose. Lose. Lose. Actually coding a flag with Elias is trivial. The number of prepended zeros is 1 less than the number of significant digits that follow so 8 (1000) is encoded as 0001000. All we need to do is encode it as 10001000. When we read the bitstring if the first digit is a 1 we know it is a doc_id, if it is a zero it is a pos. The number of prepended zeros is not affected so we read them in the usual way to know how many significant binary digits to read. We end up at the correct boundary bit and just rinse and repeat. trying to find a efficient index shema which will be fast with no system optimizations.. Dream on. It sounds as though you want to build a Google and a Google desktop search with a single mid-spec machine and without doing it in the fastest language possible (C/C++). It would be helpful if you could outline what it is you are trying to do (search of some sort) and what you hope to achieve. What you don't seem to understand about the whole compression/disk transfer thing is this: The process of indexing is incremental ie you can't do it all at once, in memory and then dump the results. For each word you have a series of doc_id and pos data but how do you add to that? Anyway you wish to cut it once it is on disk it exists as a linear stream of bytes in one or more files. To add data to any given file requires you rewrite the entire content of the file (at least that part past the insertion point). The only way to avoid endless file rewrites is to have a chunk of empty space effectively pre-alocated to expected new data. This efficiency strategy is used throughout IT and basically trades space for speed. It is done in memory by programming languages like perl (that for example automatically allocate more memory to a string than you need just in case you want to add data to it, thus avoiding an expensive call to malloc and data rewrite) or the OS which typically leaves a bit of spare space at the end of a file in case you append to it. Before getting carried away I think you need to develop a test application that can do everything you want indexing wise in memory on a small represenstative dataset. Then you can see if your ideas actually work. If it actually works as you desire then look at how to scale that so it can be done on disk with incremental data addition.	[reply]