comment on

Using perls inate integer compression pack template "w" we can also compress without hurting our speed much. This gives you a compression that depends on just how many small ints follow the doc_id. Using this compression a ~4 byte doc_id range will take 5 bytes but a pos of <128 will take 1 byte, 128^2 (16384) will take 2 bytes, 128^3 (2097152) will take 3 bytes etc.

In this scheme we add a number (still effectively the MSB) to the doc_id so any number we get back over this is a doc_id and any number under this is a pos. We let perl (in C) handle the tricky details. Using your dataset the compression is 50% but it does much better than this if you have quite a lot of small pos numbers per doc_id as they don't take much space. For example 1 doc_id and 11 pos ints < 128 take the same space as 1 doc_id and 3 pos ints < 2^31 in the old schema.

Please note this caveat. We do no error checking when serialising as it is expensive, however if you have a pos that is > MSB it will be interpretted as a doc_id. Also doc_id = 0 is not supported due to the doc_id > MSB test. Also because integers are 4 bytes if we have a doc_id > 2^31 when we add 2^31 to it we will hit the issue of 2^32 => 0 due to overflow wrap as this number is 1."0"x32 in binary. Thus is we have a doc_id > MSB it will get misinterpreted as a pos. This is also not checked due to 1) expense and 2) this code can't deal with it (it might work if you compile perl to support 64 bit ints but I can't be bothered to check).

my $MSB       = 1<<31;

my $tests = 250_000;
my $DEBUG =1;
print "Doing $tests tests\n";
my $pack = time();

my $str = '';
for my $doc_id( 1 .. $tests ) {
    #printf "Doc id: %d\n", $doc_id if $DEBUG;
    $str .= pack "w", $doc_id+$MSB;
    for my $pos ( 0 .. 2 ) {
        $str .= pack "w", $pos;
    }
}
printf "pack time %d\n", time()-$pack;
printf "length %d\n", length $str;
my $unpack = time();

my $dat = {};
my $doc_id = undef;
for my $int (unpack "w*", $str) {
    if ( $int > $MSB ) 
    {
        $doc_id = $int - $MSB;
        #printf "\nDoc id: %d\t", $doc_id if $DEBUG;
    }
    else 
    {
        push @{$dat->{$doc_id}}, $int;
        #print "$int\t" if $DEBUG;
    }
}

printf "\n\nunpack time %d\n",time()-$unpack;

__DATA__
C:\>serialize.pl
Doing 250000 tests
pack time 2
length 2000000
unpack time 2

C:\>
[download]

In reply to Re^5: Byte allign compression in Perl.. by tachyon-II
in thread Byte allign compression in Perl.. by MimisIVI

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.