in reply to Compress positive integers

Hi guys,

First of all thanks for your answers...


My mistake was that i didnt explain you what is one inverted index...

For each term that occurs in the document collection
which i index, i save in a DBMS a string with some required info
This info is the Doc ids wich the term occurs and the positions where the term occurs in each document...



Inverted Index schema for the word with ID eq 1:
(this word appears only in two pages)

WORD ID==>POSTING LIST
1 ====> "Doc1 Space Pos1;Pos2;..PosN; Space Doc2 Space Pos1;Pos2;..PosN;"


As i told you before the problem is when i index a big
document collection (f.e. The whole WIKIpedia). In this case
the strings(posting List) where i save the information for each term is very large
and the fetching from the disk is very expensive.
Thats why i told you the solution is to compress efficient the posting lists..

The biggest problem is that i am not use the pack function so i am working with scalar
values and not BINARY ...
The document collection is not something standar neither the size
of each document so i cant answer to these questions..

A very common compression technique is the Elias Gamma code (http://en.wikipedia.org/wiki/Elias_gamma_coding)
In the Wiki page there is code in JAVA and i translate it in Perl like below...

sub GammaCode { my $n = $_[0]; my $gammaCode; my $mystring; my $log = int(log2($n)); return 1 if $n==1; for(1 .. $log) { $gammaCode .="1"; } $gammaCode .="0"; $mystring = dec2bin($n); $mystring =~ s/^\w{1}//; $gammaCode .=$mystring; return $gammaCode; } sub log2 { my $n = shift; return log($n)/log(2); } sub dec2bin { my $str = unpack("B32", pack("N", shift)); $str =~ s/^0+(?=\d)//; # otherwise you'll get leading zeros return $str; } sub bin2dec { return unpack("N", pack("B32", substr("0" x 32 . shift, -32))); } my $number=11; print "The Gamma reprasentation of $number is\t",GammaCode($number),"\ +n";

The above sub is working right for the compression but the problem is that i
dont know how to write the return value of the sub as binary ... Furthermore i
dont know how to decode this algorithm..
If you have any touch with the Information Extraction field
and very good knowledge of the pack/unpack functions plz help...

Thanks for your interest!!
Mimis

P.S BrowserUk you have right, i should refer the source of
the byte-alligned algorithm...thanks that you did for me
It seems that you are familiar with this field,
i hope you can help me...

Replies are listed 'Best First'.
Re^2: Compress positive integers
by BrowserUk (Patriarch) on Apr 08, 2008 at 07:33 UTC

    You can save about 60% of the space by storing this data in binary. The following assumes that your each of your documents can be upto 4GB (hence the 'N' pack format). (This is very tersely coded just for a quick test):

    #! perl -slw use strict; use List::Util qw[ sum ]; use Math::Random::MT qw[ rand ]; use Data::Dump qw[ pp ]; $Data::Dump::MAX_WIDTH = 1000; use constant NO_OF_DOCS => 10000; use constant DOC_PER_WORD => NO_OF_DOCS / 2; use constant OCCURANCES_PER_DOC => 20; ## Build some test data my @words = qw[ the quick brown fox jumps over a lazy dog twice ]; my %index = map{ $_ => {} } @words; for my $word ( keys %index ) { for my $doc ( map{ int rand NO_OF_DOCS } 1 .. DOC_PER_WORD ) { my $docName = sprintf "Doc%05d", $doc; $index{ $word }{ $docName } = [ map{ int rand 2**32 } 1 .. OCCURANCES_PER_DOC ] } } ##pp \%index; ## Building the posting lists in ASCII my @postings = map { my $wordHash = $_; join ' ', map{ "$_ " . join ':', @{ $wordHash->{ $_ } }; } keys %{ $wordHash }; } values %index; ## pp \@postings; printf "ASCII uses %d bytes\n", sum map length, @postings; ## Doing them in binary my @binPostings = map { my $wordHash = $_; pack '(N/A*)*', map { pack 'C/A* A*', $_, pack 'N*', @{ $wordHash->{ $_ } }; } keys %{ $wordHash }; } values %index; printf "Binary uses %d bytes\n", sum map length, @binPostings; #pp \@binPostings; <>; ## Unpack the packed binary my $wordIndex = 0; for my $wordData ( @binPostings ) { print $words[ $wordIndex ++ ]; for my $docData ( unpack '(N/A*)*', $wordData ) { my( $doc, $posns ) = unpack 'C/A* A*', $docData; print "\t$doc: ", join ', ', unpack 'N*', $posns; } } __END__ ## Output for 10 words appearing 20 times in each of ## half of 10,000 4GB documents at random positions. C:\test>678848 ASCII uses 8,801,314 bytes Binary uses 3,656,853 bytes

    But don't stop reading yet.

    If you are storing these records in an RDBMS, you are wasting the potential of that tool. Doing it this way (in either ASCII or binary), to find all the documents that contain 2 words, you are having to retrieve and unpack all the positions for both words in every document in which they appear, and then doing the union of the document sets in Perl.

    RDBMSs exist to do set algebra. Doing it in Perl and having to retrieve and unpack all that data to do it is silly. You'd be far better off changing your schema to allow the RDBMS to give you only those offsets for those documents that contain the words or words you require. But you'll need a DBA, or at least someone who SQL skills are a lot more current than my own to set you straight on that.

    A schema something like:

    TABLE words word-id MEDIUMINT primary key word VARCHAR TABLE docs doc-id INTEGER primary key docname VARCHAR TABLE word-doc word-id MEDIUMINT REFERENCES words( word-id) } primary key doc-id INTEGER REFERENCES docs( doc-id ) } posns MEDIUMBLOB

    Should be far more compact and allow you to issue queries that find (for example) all the documents contain two (or more) given words and only return the offsets for those words in those documents. Far more of the work done by the RDBMS and far less data to be read, transmitted and unpacked.

    You could then go a step further and investingate bitstring fields and boolean algebra.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Hi again,

      I am afraid that you didnt understand the shema that i use...

      One more time...

      Document Collection: 5000 documents
      Average size of each document(nr of words): 554 words
      Number of individual words that appear in the colection: 15000


      Now for each word that appears in the collection (in this case 15000 times)
      i save in the DBMS 15000 posting lists...like below...


      word="Gratis" --> word id=15
      column word Id ==> 15
      column Posting List ==> "1 2;3;4; 2 5;2; 3 1;15;......

      word="Twee" --> word id=10
      column word Id ==> 10
      column Posting List ==> "10 20;3 21 2; 43 100;105;......

      Like this way for each of the 15000 words....

      So if the user will give the query -> "gratis twee"

      i will fetch the two above posting lists and i merge them
      to find wich documents have the both terms and i will rank higher the docs which
      have the terms knear to each other by checking their positions(proximity score)


      i am afraid that your solution isnt what i actually i look for,
      ..but anyway thanks for your time..


      As i wrote in previous post i found a way to encode
      postive integers not matter how big they are by
      using the Elias gamma code...

      An example of the code is like below...

      Number(ASCI) --> Gamma representation(BITS) 1 = 1 2 = 010 3 = 011 4 = 00100 e.t.c...


      My first question is how can translate the below string as bit string...

      posting list==> "1110001110101011111101101111011"

      and my last question when i will fetch this string how i
      will unpack it and read one BIT per time..?

      The decode process is like that:

      1. Read and count 1s from the stream until you reach the first 0. Call this count of ones N.
      2.read the next N bits of the stream and translate them in ASCI..
      (f.e. 101 = 5 )
      3.So to decode the number i sum N to the 2 power with the number of the step 2.


      So the first number is the 9 = 1110001.
      The second is the 6 = 11010...and e.t.c.

      I hope you can help me ....

      Regards
      Mimis

        You don't seem to be listening. What you are trying to do will never work at scale. You need to do all the heavy lifting on the RDBMS side. The heart of the problem is this:

        column Posting List ==> "10 20;3 21 2; 43 100;105;......

        The core problem with this format is that you MUST pull it back into perl to manipulate it. The more documents you have and the more words per document the more data you need to pull to process your query. This will never scale. Not today, not tomorrow, not never. Compression will only vaguely paper over the cracks, if it helps at all (because of the overhead price).

        The table format suggested by BrowerUk is not optimal but is along the lines of what you want as it indicates the need for 3 basic tables. For best speed you want to avoid BLOB and VARCHAR ie anything that is not fixed width as this makes indexing much harder and it is the indexing that is going to provide the speed/scalability in the same way a hash allows you to have fast "random" access to a large array.

        The basic unit of data you want to store consists of

        WORD IN_DOCUMENT AT_POSITION

        To efficiently store this you will want to store this unit of data in a compressed form. The compressed form you really want is one where the WORD becomes a WORD_ID and the IN_DOCUMENT becomes a DOC_ID. If you think about a search you want to input words to search for, thus the words table needs to rapidly convert a WORD to a WORD_ID. Using this we will look in our basic unit data table and get back some DOC_IDs which we then need to convert back to IN_DOCUMENT. Thus the tables you need are:

        TABLE words word CHAR(32) PRIMARY KEY word_id INT TABLE docs doc_id INT PRIMARY KEY doc_name VARCHAR TABLE word_doc word_id INT doc_id INT pos INT

        To complete this we must create an index on the word_doc table. This table as it stands is just a big array. When we create an index on it it effectively becomes a big hash that we can get rapid random access to

        CREATE INDEX warpspeed ON word_doc (word_id)

        With this structure in place you can then make a query into the word_doc table that will return the data you want. The basic query will be:

        SELECT doc_id FROM word_doc WHERE word_id = ? INTERSECT SELECT doc_id FROM word_doc WHERE word_id = ?

        This will work but will still be slow. The problem is that it is a big task and we can't use the index to go straight to the data we want. The solution to this is to predict and cache.

        The problem with the Elias gamma encoding is that even if you implement the bit-twiddling in C, because the fields span byte boundaries, the process of compressing and decompressing your posting lists is slower than unpacking your ASCII representation using pack/unpack.

        So, whilst you might achieve a few percent more compression than with the simple binary packing I showed elsewhere, the time saved in reading and transmitting that slightly smaller volume of data, is negated by the time it takes to decompress it.

        The algorithm you posted in your OP, that is taken from the Buettcher paper seeks to address that problem, by only packing to whole byte boundaries. The problem with it is that the code show in the paper for compression and decompression is uselessly incomplete. Eg.

        Compression code given:

        int outPos = 0, previous = 0; for (int inPos = 0; inPos < n; inPos++) { int delta = uncompressed[inPos] - previous; while (delta >= 128) { compressed[outPos++] = (delta & 127) | 128; delta = delta >> 7; } compressed[outPos++] = delta; }

        Note: n in the for loop is never set. This can be derived from the length of the input. However, notice that previous is initialised to 0, and subtracted from each input number, but it is never set to any value other than zero!

        Decompression code given:

        int outPos = 0, previous = 0; for (int outPos = 0; outPos < n; outPos++) { for (int shift = 0; ; shift += 7) { int temp = compressed[inPos++]; previous += ((temp & 127) << shift); if (temp < 128) break; } uncompressed[outPos] = previous; }

        Note: Again, n is never set. But this time, you cannot derive it from the input. The input is a bit-stream--ie. a string of bytes--but you cannot use strlen as the string will contain embed nulls.

        So, one algorithm is costs more than it saves and the other that seeks to address that is incomplete, and would probably still cost too much if implemented in Perl.

        The major costs of your schema are:

        • the volume of data you are transferring for each 'hit'.
        • The time taken to unpack that data.
        • The time taken to sub-select the small number of relevant sets of document offsets from the large number you are retrieving.

          Each time, you are having to unpack/decompress the document offsets for every document that contains a particular word, in order to gain access to just those that match your complete query.

          Using your numbers: 5,000 docs * 554 words/doc / 15,000 unique words = each word appears in (on average) 185 documents.

          That means you read, transfer and unpacked/decompressed all the offsets for 185 documents for each word . But probabilistically, for any given 2 word query, you only need the data for 7 documents.

          So you've had to handle the data for 370 documents, but then discard it for 356 of them.

        If you went with the schema I presented, you only read, transfer and unpack/decompress the data for those 7 documents that contain both words. That's 14/370*100 = less than 4% of the data to be manipulated.

        That saving will far exceed any addition compression you might get from using any of the methods you've mooted. Or any other compression mechanism.

        I don't think that breaking the schema down beyond the word-doc-offsets (plural) I suggested, to word-doc-offset (singular), will help much. You still need to transfer all the offsets for each word-doc pairing anyway, and the explosion in the size of the indexes will probably slow things.

        I think tachyon's point about variable length fields is a good one, except that it only affects the performance of indexes, if the variable length data is a part of the index. In the schema I outlined this is not the case. But, I'm not a DBA.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.