in reply to Re^4: Byte allign compression in Perl..
in thread Byte allign compression in Perl..

Using perls inate integer compression pack template "w" we can also compress without hurting our speed much. This gives you a compression that depends on just how many small ints follow the doc_id. Using this compression a ~4 byte doc_id range will take 5 bytes but a pos of <128 will take 1 byte, 128^2 (16384) will take 2 bytes, 128^3 (2097152) will take 3 bytes etc.

In this scheme we add a number (still effectively the MSB) to the doc_id so any number we get back over this is a doc_id and any number under this is a pos. We let perl (in C) handle the tricky details. Using your dataset the compression is 50% but it does much better than this if you have quite a lot of small pos numbers per doc_id as they don't take much space. For example 1 doc_id and 11 pos ints < 128 take the same space as 1 doc_id and 3 pos ints < 2^31 in the old schema.

Please note this caveat. We do no error checking when serialising as it is expensive, however if you have a pos that is > MSB it will be interpretted as a doc_id. Also doc_id = 0 is not supported due to the doc_id > MSB test. Also because integers are 4 bytes if we have a doc_id > 2^31 when we add 2^31 to it we will hit the issue of 2^32 => 0 due to overflow wrap as this number is 1."0"x32 in binary. Thus is we have a doc_id > MSB it will get misinterpreted as a pos. This is also not checked due to 1) expense and 2) this code can't deal with it (it might work if you compile perl to support 64 bit ints but I can't be bothered to check).

my $MSB = 1<<31; my $tests = 250_000; my $DEBUG =1; print "Doing $tests tests\n"; my $pack = time(); my $str = ''; for my $doc_id( 1 .. $tests ) { #printf "Doc id: %d\n", $doc_id if $DEBUG; $str .= pack "w", $doc_id+$MSB; for my $pos ( 0 .. 2 ) { $str .= pack "w", $pos; } } printf "pack time %d\n", time()-$pack; printf "length %d\n", length $str; my $unpack = time(); my $dat = {}; my $doc_id = undef; for my $int (unpack "w*", $str) { if ( $int > $MSB ) { $doc_id = $int - $MSB; #printf "\nDoc id: %d\t", $doc_id if $DEBUG; } else { push @{$dat->{$doc_id}}, $int; #print "$int\t" if $DEBUG; } } printf "\n\nunpack time %d\n",time()-$unpack; __DATA__ C:\>serialize.pl Doing 250000 tests pack time 2 length 2000000 unpack time 2 C:\>

Replies are listed 'Best First'.
Re^6: Byte allign compression in Perl..
by BrowserUk (Patriarch) on Apr 12, 2008 at 04:37 UTC

    I added W-BER to my benchmark. It just about beats binary ('v*') on space. About the same for unpacking but slower for packing.

    Any way you slice it, the single best option is to restructure the schema to cause less data to need retrieval...but then you know that :)

    ## Note this run uses 1000 documents rather than 5000. ## The relative performance remains the same though. Run with 15000 unique words in 1000 documents (Ave: 554 words/doc) ASCII uses 4755001 bytes W-BER uses 3196726 bytes Binary uses 3279128 bytes Elias uses 4118678 bytes 1 trial of Packing ascii (9.078s total) 1 trial of Unpacking ascii (2.942s total) 1 trial of Packing W-BER (16.144s total) 1 trial of Unpacking W-BER (8.219s total) 1 trial of Packing binary (9.504s total) 1 trial of Unpacking binary (8.223s total) 1 trial of Packing Elias (11.176s total) 1 trial of Unpacking Elias (9.750s total)

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      As far as I can see the intersection problem is On*Ologn which make scaling it problematic. There appear to be spcial cases where this can be reduced to On which allow you to deal with an infinitely large data set in finite time provided you have the horsepower.

      It was interesting to see how bad subst was in the code above as compared to a straight unpack (~ 40x slower). Oh for a world full of fun problems to solve. Anyway back to the day to day drudgery ;-)

        As far as I can see the intersection problem is On*Ologn which make scaling it problematic.

        How does 312ms sound for a two word query in a DB containing 15,000 words in 5000 documents (ave:554 words/doc; 2.7million word-document pairs?

        678848=# select 678848-# "wordFromId"( "word-id" ), 678848-# "docnameFromId"( "doc-id" ), 678848-# posns 678848-# from "word-doc" 678848-# where 678848-# "word-id" in ( 678848(# "idFromWord"( 'aachen' ), 678848(# "idFromWord"( 'zwitterions' ) 678848(# ) 678848-# and 678848-# "doc-id" in ( 678848(# select "docIdsContaining"( "idFromWord"( 'aach +en' ) ) 678848(# intersect 678848(# select "docIdsContaining"( "idFromWord"( 'zwit +terions' ) ) 678848(# ) 678848-# order by 678848-# "word-id", "doc-id" 678848-# ; wordFromId | docnameFromId | posns -------------+---------------+------- aachen | document_223 | 1723 aachen | document_940 | 1903 aachen | document_1778 | 273 aachen | document_1897 | 3163 aachen | document_3990 | 817 aachen | document_4407 | 4736 zwitterions | document_223 | 3072 zwitterions | document_940 | 3504 zwitterions | document_1778 | 664 zwitterions | document_1897 | 4186 zwitterions | document_3990 | 355 zwitterions | document_4407 | 1459 (12 rows) 678848=# select count( * ) from "word-doc";; count --------- 2719282 (1 row)

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.