in reply to Re: Indexing two large text files
in thread Indexing two large text files

you can prepare the hash so that it contains, as the “value” that is associated with the key, the position of the start of the record in the file and maybe also the length of that record.

The problem with that is, very little of the size of hash in memory is occupied by its values. And the internal overhead associated with each value is such that until the value gets to be over something like 16 bytes, there is no savings in storing a shorter value.

C:\test>p1 $h{ $_ } = 'xxxxxxxx' for 1 .. 1e6;; print total_size( \%h );; 112277576 C:\test>p1 $h{ $_ } = 1 for 1 .. 1e6;; print total_size( \%h );; 72277576

So unless the records are particularly long, there is little savings to be made by storing the file offset over storing the record itself.

I'm using a 64-bit perl, which means that integers occupy 8 bytes, which flatters my point somewhat, but even on a 32-bit perl, the savings are far from what you might at first expect. For example, the same 1 million key hash with "no values" at all still occupies a substantial the same as one with values:

C:\test>p1 undef $h{ $_ } for 1 .. 1e6;; print total_size( \%h );; 72277576

In summary: You can gain some space by storing record positions instead of the records themselves, but it doesn't buy you as much as you at first might think it would.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^3: Indexing two large text files
by locked_user sundialsvc4 (Abbot) on Apr 09, 2012 at 22:02 UTC

    Granted... particularly in this case, I agree completely.   A 350MB total-size file can simply fit in memory and be done with it.   (I know that you have recently dealt with files that are several orders of magnitude larger.)

    The notion of using an SQLite file, literally as a persistent index covering as many keys as may be necessary, is actually the one that I tend to come back to, over and over again, when dealing with files like these.   I need to know where to find, via direct access, whatever it is I am looking for.   One pass through the file locates everything.   The “interesting stuff” now gets done with JOINs, often in a command-line ad hoc fashion.   Not in the “gigantic” case you recently spoke of, but maybe a very useful idea in this one.

    Not kosher to SQL tricks?   And, (a mere...) 350 megs?   “If you’ve got the RAM, then by all means use it and be done.”   Perl won’t blink an eye.

      A 350MB total-size file can simply fit in memory and be done with it. (I know that you have recently dealt with files that are several orders of magnitude larger.)

      Slurped into a scalar, okay. But for the OP's purpose he would need to build a hash from it, and that would require 52.5GB of ram.

      Not impossible for sure, but it would (still, currently) take a machine that is a cut (or two) above the average commodity box, many of which the motherboards are still limited to 16 or 32GB.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        I feel like I'm missing something. Why would it take 52GB of memory to build a hash from 350MB of data? Does the hash overhead really take 150 times as much space as the data itself? I just wrote a little script that takes one of my httpd logs, splits each line on the first ", and uses those two sections as key and value of a hash. This log file is 27MB, and Devel::Size->total_size says the resulting hash is 38MB. That's 40% overhead, which seems much more reasonable, and would mean the original poster's 350MB might take up 500MB as a hash, still well within his limits.

        Aaron B.
        My Woefully Neglected Blog, where I occasionally mention Perl.

      Unfortunately I can not update the RAM in my working environment. It is not my own box. I think they even set a memory cap. Any program takes +1G will be killed automatically. :(