in reply to Indexing two large text files

Also note that you can prepare the hash so that it contains, as the “value” that is associated with the key, the position of the start of the record in the file and maybe also the length of that record.   Thus, even if the file itself is “large” (and by modern standards, 350M really isn’t), you are only keeping track of the keys and the location of the rest of it ... which you can retrieve as needed.

If you intend to do this frequently, consider also using, say, an SQLite database (file...) for all or part of the task, which gives you (for example) the easy option of indexing the data in several different ways at the same time, all of them ultimately using the “key + starting location + size of record” strategy.   A single sequential pass through the file is enough to (re-)load these indexes, and now you can use SQL JOINs to start asking questions without doing any further “programming, per se.“   The file being indexed, and the records therein, could be huge, and the index-databases would remain small and of course, persistent.

Naturally, it depends entirely on what you intend to do ... on what part of the technical trade-off you intend to emphasize, vs. what price you can afford to pay to get it.

Replies are listed 'Best First'.
Re^2: Indexing two large text files
by BrowserUk (Patriarch) on Apr 09, 2012 at 18:50 UTC
    you can prepare the hash so that it contains, as the “value” that is associated with the key, the position of the start of the record in the file and maybe also the length of that record.

    The problem with that is, very little of the size of hash in memory is occupied by its values. And the internal overhead associated with each value is such that until the value gets to be over something like 16 bytes, there is no savings in storing a shorter value.

    C:\test>p1 $h{ $_ } = 'xxxxxxxx' for 1 .. 1e6;; print total_size( \%h );; 112277576 C:\test>p1 $h{ $_ } = 1 for 1 .. 1e6;; print total_size( \%h );; 72277576

    So unless the records are particularly long, there is little savings to be made by storing the file offset over storing the record itself.

    I'm using a 64-bit perl, which means that integers occupy 8 bytes, which flatters my point somewhat, but even on a 32-bit perl, the savings are far from what you might at first expect. For example, the same 1 million key hash with "no values" at all still occupies a substantial the same as one with values:

    C:\test>p1 undef $h{ $_ } for 1 .. 1e6;; print total_size( \%h );; 72277576

    In summary: You can gain some space by storing record positions instead of the records themselves, but it doesn't buy you as much as you at first might think it would.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Granted... particularly in this case, I agree completely.   A 350MB total-size file can simply fit in memory and be done with it.   (I know that you have recently dealt with files that are several orders of magnitude larger.)

      The notion of using an SQLite file, literally as a persistent index covering as many keys as may be necessary, is actually the one that I tend to come back to, over and over again, when dealing with files like these.   I need to know where to find, via direct access, whatever it is I am looking for.   One pass through the file locates everything.   The “interesting stuff” now gets done with JOINs, often in a command-line ad hoc fashion.   Not in the “gigantic” case you recently spoke of, but maybe a very useful idea in this one.

      Not kosher to SQL tricks?   And, (a mere...) 350 megs?   “If you’ve got the RAM, then by all means use it and be done.”   Perl won’t blink an eye.

        A 350MB total-size file can simply fit in memory and be done with it. (I know that you have recently dealt with files that are several orders of magnitude larger.)

        Slurped into a scalar, okay. But for the OP's purpose he would need to build a hash from it, and that would require 52.5GB of ram.

        Not impossible for sure, but it would (still, currently) take a machine that is a cut (or two) above the average commodity box, many of which the motherboards are still limited to 16 or 32GB.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

        Unfortunately I can not update the RAM in my working environment. It is not my own box. I think they even set a memory cap. Any program takes +1G will be killed automatically. :(