Using perls inate integer compression pack template "w" we can also compress without hurting our speed much. This gives you a compression that depends on just how many small ints follow the doc_id. Using this compression a ~4 byte doc_id range will take 5 bytes but a pos of <128 will take 1 byte, 128^2 (16384) will take 2 bytes, 128^3 (2097152) will take 3 bytes etc.
In this scheme we add a number (still effectively the MSB) to the doc_id so any number we get back over this is a doc_id and any number under this is a pos. We let perl (in C) handle the tricky details. Using your dataset the compression is 50% but it does much better than this if you have quite a lot of small pos numbers per doc_id as they don't take much space. For example 1 doc_id and 11 pos ints < 128 take the same space as 1 doc_id and 3 pos ints < 2^31 in the old schema.
Please note this caveat. We do no error checking when serialising as it is expensive, however if you have a pos that is > MSB it will be interpretted as a doc_id. Also doc_id = 0 is not supported due to the doc_id > MSB test. Also because integers are 4 bytes if we have a doc_id > 2^31 when we add 2^31 to it we will hit the issue of 2^32 => 0 due to overflow wrap as this number is 1."0"x32 in binary. Thus is we have a doc_id > MSB it will get misinterpreted as a pos. This is also not checked due to 1) expense and 2) this code can't deal with it (it might work if you compile perl to support 64 bit ints but I can't be bothered to check).
my $MSB = 1<<31; my $tests = 250_000; my $DEBUG =1; print "Doing $tests tests\n"; my $pack = time(); my $str = ''; for my $doc_id( 1 .. $tests ) { #printf "Doc id: %d\n", $doc_id if $DEBUG; $str .= pack "w", $doc_id+$MSB; for my $pos ( 0 .. 2 ) { $str .= pack "w", $pos; } } printf "pack time %d\n", time()-$pack; printf "length %d\n", length $str; my $unpack = time(); my $dat = {}; my $doc_id = undef; for my $int (unpack "w*", $str) { if ( $int > $MSB ) { $doc_id = $int - $MSB; #printf "\nDoc id: %d\t", $doc_id if $DEBUG; } else { push @{$dat->{$doc_id}}, $int; #print "$int\t" if $DEBUG; } } printf "\n\nunpack time %d\n",time()-$unpack; __DATA__ C:\>serialize.pl Doing 250000 tests pack time 2 length 2000000 unpack time 2 C:\>
In reply to Re^5: Byte allign compression in Perl..
by tachyon-II
in thread Byte allign compression in Perl..
by MimisIVI
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |