You can save about 60% of the space by storing this data in binary. The following assumes that your each of your documents can be upto 4GB (hence the 'N' pack format). (This is very tersely coded just for a quick test):

#! perl -slw use strict; use List::Util qw[ sum ]; use Math::Random::MT qw[ rand ]; use Data::Dump qw[ pp ]; $Data::Dump::MAX_WIDTH = 1000; use constant NO_OF_DOCS => 10000; use constant DOC_PER_WORD => NO_OF_DOCS / 2; use constant OCCURANCES_PER_DOC => 20; ## Build some test data my @words = qw[ the quick brown fox jumps over a lazy dog twice ]; my %index = map{ $_ => {} } @words; for my $word ( keys %index ) { for my $doc ( map{ int rand NO_OF_DOCS } 1 .. DOC_PER_WORD ) { my $docName = sprintf "Doc%05d", $doc; $index{ $word }{ $docName } = [ map{ int rand 2**32 } 1 .. OCCURANCES_PER_DOC ] } } ##pp \%index; ## Building the posting lists in ASCII my @postings = map { my $wordHash = $_; join ' ', map{ "$_ " . join ':', @{ $wordHash->{ $_ } }; } keys %{ $wordHash }; } values %index; ## pp \@postings; printf "ASCII uses %d bytes\n", sum map length, @postings; ## Doing them in binary my @binPostings = map { my $wordHash = $_; pack '(N/A*)*', map { pack 'C/A* A*', $_, pack 'N*', @{ $wordHash->{ $_ } }; } keys %{ $wordHash }; } values %index; printf "Binary uses %d bytes\n", sum map length, @binPostings; #pp \@binPostings; <>; ## Unpack the packed binary my $wordIndex = 0; for my $wordData ( @binPostings ) { print $words[ $wordIndex ++ ]; for my $docData ( unpack '(N/A*)*', $wordData ) { my( $doc, $posns ) = unpack 'C/A* A*', $docData; print "\t$doc: ", join ', ', unpack 'N*', $posns; } } __END__ ## Output for 10 words appearing 20 times in each of ## half of 10,000 4GB documents at random positions. C:\test>678848 ASCII uses 8,801,314 bytes Binary uses 3,656,853 bytes

But don't stop reading yet.

If you are storing these records in an RDBMS, you are wasting the potential of that tool. Doing it this way (in either ASCII or binary), to find all the documents that contain 2 words, you are having to retrieve and unpack all the positions for both words in every document in which they appear, and then doing the union of the document sets in Perl.

RDBMSs exist to do set algebra. Doing it in Perl and having to retrieve and unpack all that data to do it is silly. You'd be far better off changing your schema to allow the RDBMS to give you only those offsets for those documents that contain the words or words you require. But you'll need a DBA, or at least someone who SQL skills are a lot more current than my own to set you straight on that.

A schema something like:

TABLE words word-id MEDIUMINT primary key word VARCHAR TABLE docs doc-id INTEGER primary key docname VARCHAR TABLE word-doc word-id MEDIUMINT REFERENCES words( word-id) } primary key doc-id INTEGER REFERENCES docs( doc-id ) } posns MEDIUMBLOB

Should be far more compact and allow you to issue queries that find (for example) all the documents contain two (or more) given words and only return the offsets for those words in those documents. Far more of the work done by the RDBMS and far less data to be read, transmitted and unpacked.

You could then go a step further and investingate bitstring fields and boolean algebra.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re^2: Compress positive integers by BrowserUk
in thread Compress positive integers by MimisIVI

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.