Using perls inate integer compression pack template "w" we can also compress without hurting our speed much. This gives you a compression that depends on just how many small ints follow the doc_id. Using this compression a ~4 byte doc_id range will take 5 bytes but a pos of <128 will take 1 byte, 128^2 (16384) will take 2 bytes, 128^3 (2097152) will take 3 bytes etc.

In this scheme we add a number (still effectively the MSB) to the doc_id so any number we get back over this is a doc_id and any number under this is a pos. We let perl (in C) handle the tricky details. Using your dataset the compression is 50% but it does much better than this if you have quite a lot of small pos numbers per doc_id as they don't take much space. For example 1 doc_id and 11 pos ints < 128 take the same space as 1 doc_id and 3 pos ints < 2^31 in the old schema.

Please note this caveat. We do no error checking when serialising as it is expensive, however if you have a pos that is > MSB it will be interpretted as a doc_id. Also doc_id = 0 is not supported due to the doc_id > MSB test. Also because integers are 4 bytes if we have a doc_id > 2^31 when we add 2^31 to it we will hit the issue of 2^32 => 0 due to overflow wrap as this number is 1."0"x32 in binary. Thus is we have a doc_id > MSB it will get misinterpreted as a pos. This is also not checked due to 1) expense and 2) this code can't deal with it (it might work if you compile perl to support 64 bit ints but I can't be bothered to check).

my $MSB = 1<<31; my $tests = 250_000; my $DEBUG =1; print "Doing $tests tests\n"; my $pack = time(); my $str = ''; for my $doc_id( 1 .. $tests ) { #printf "Doc id: %d\n", $doc_id if $DEBUG; $str .= pack "w", $doc_id+$MSB; for my $pos ( 0 .. 2 ) { $str .= pack "w", $pos; } } printf "pack time %d\n", time()-$pack; printf "length %d\n", length $str; my $unpack = time(); my $dat = {}; my $doc_id = undef; for my $int (unpack "w*", $str) { if ( $int > $MSB ) { $doc_id = $int - $MSB; #printf "\nDoc id: %d\t", $doc_id if $DEBUG; } else { push @{$dat->{$doc_id}}, $int; #print "$int\t" if $DEBUG; } } printf "\n\nunpack time %d\n",time()-$unpack; __DATA__ C:\>serialize.pl Doing 250000 tests pack time 2 length 2000000 unpack time 2 C:\>

In reply to Re^5: Byte allign compression in Perl.. by tachyon-II
in thread Byte allign compression in Perl.. by MimisIVI

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.