in reply to Re^2: Indexing two large text files
in thread Indexing two large text files

Granted... particularly in this case, I agree completely.   A 350MB total-size file can simply fit in memory and be done with it.   (I know that you have recently dealt with files that are several orders of magnitude larger.)

The notion of using an SQLite file, literally as a persistent index covering as many keys as may be necessary, is actually the one that I tend to come back to, over and over again, when dealing with files like these.   I need to know where to find, via direct access, whatever it is I am looking for.   One pass through the file locates everything.   The “interesting stuff” now gets done with JOINs, often in a command-line ad hoc fashion.   Not in the “gigantic” case you recently spoke of, but maybe a very useful idea in this one.

Not kosher to SQL tricks?   And, (a mere...) 350 megs?   “If you’ve got the RAM, then by all means use it and be done.”   Perl won’t blink an eye.

Replies are listed 'Best First'.
Re^4: Indexing two large text files
by BrowserUk (Patriarch) on Apr 09, 2012 at 22:14 UTC
    A 350MB total-size file can simply fit in memory and be done with it. (I know that you have recently dealt with files that are several orders of magnitude larger.)

    Slurped into a scalar, okay. But for the OP's purpose he would need to build a hash from it, and that would require 52.5GB of ram.

    Not impossible for sure, but it would (still, currently) take a machine that is a cut (or two) above the average commodity box, many of which the motherboards are still limited to 16 or 32GB.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      I feel like I'm missing something. Why would it take 52GB of memory to build a hash from 350MB of data? Does the hash overhead really take 150 times as much space as the data itself? I just wrote a little script that takes one of my httpd logs, splits each line on the first ", and uses those two sections as key and value of a hash. This log file is 27MB, and Devel::Size->total_size says the resulting hash is 38MB. That's 40% overhead, which seems much more reasonable, and would mean the original poster's 350MB might take up 500MB as a hash, still well within his limits.

      Aaron B.
      My Woefully Neglected Blog, where I occasionally mention Perl.

        I did this:

        C:\test>p1 $h{ $_ } = 'x'x50 for 1 .. 10e6;; print total_size( \%h );; 1583106697 print 1583106697 * 35;; 55408734395

        Which looking back at the OP means I calculated the size of a 350 million record file instead of a 350MB file. My mistake.

        A more appropriate figure for the OPs 350MB file is 3.8GB:

        C:\test>dir file2x 10/04/2012 17:27 369,499,228 file2x C:\test>perl -nle"($k,$v)=split '\*'; $h{$k}=$v }{ print 'Check mem'; +<>" file2x Check mem 3.8GB

        I did try to use the latest Devel::Size to do the measurement, but it pushed the memory usage over 8GB before crashing. Looks like it is time for a new release of my unauthorised version.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re^4: Indexing two large text files
by never_more (Initiate) on Apr 10, 2012 at 11:57 UTC
    Unfortunately I can not update the RAM in my working environment. It is not my own box. I think they even set a memory cap. Any program takes +1G will be killed automatically. :(