Craig720 has asked for the wisdom of the Perl Monks concerning the following question:

I search large text files line by line for keywords. If I get a hit, I read the entire file into a scalar variable.

These files contain several instances of delimited sections. I then loop through the scalar variable, searching each instance of the delimited section for a keyword.

I use the global search if($string =~ m/regex/gix) to test if a regex matches a string (scalar variable). If I get a hit, I process that section.

My problem is, the text files are getting even larger. Not consistently, but often enough that I must have a procedure for handling them. If the file is too bloody large, I can no longer simply read it into a scalar variable.

I have to replace my wonderful global search with..., well, I don't know what, yet. Any ideas?

Replies are listed 'Best First'.
Re: Processing LARGE text files
by CountOrlok (Friar) on Mar 07, 2006 at 18:03 UTC
    Don't read line by line or slurp in the whole file. Read in one delimited section at a time. For example if your delimiter is "end of record\n", do this:
    local $/ = "end of record\n"; while(<>) { # process a delimited section if matches your criteria }
    -imran
      Thanks for the reply. Your suggestion does make sense.

      I think I tried 'chunking' once. Didn't work out too well. I experienced 'Sudden Flaming Death' -- my error message. I'll have to give your method another try in the morning when I'm fresh.

      Thanks for the tip. I'll see what happens.

Re: Processing LARGE text files
by zentara (Cardinal) on Mar 07, 2006 at 18:16 UTC
      I tried Tie::File quite some time ago. Unless I was using it incorrectly, I found it altered the source file itself.

      If I was using it correctly, then I cannot use Tie::File. I need the original documents from which I extract the data to be untouched.

      When processed data is posted to our website, the unaltered source document is also posted. A link to that source is posted adjacent to the processed data.

        I don't think you were using it correctly, or you had some other bug in your code. Are you suggesting that Tie::File has a bug? Look at the section 'mode' in "perldoc Tie::File" to see how to keep your file read-only.

        I'm not really a human, but I play one on earth. flash japh
Re: Processing LARGE text files
by QM (Parson) on Mar 07, 2006 at 18:20 UTC
    I second CountOrlok's suggestion.

    You'll actually find that storing the whole file in memory will cause disk swapping at some point, slowing down your process. If instead you read in a manageable chunk at a time, the process will run about as fast as possible, spending most of the time reading and matching, and no time writing virtual memory to disk.

    Also, I don't know what your file format is, but m//gix doesn't necessarily do the right thing across newlines.

    In general, programs have to be designed to work with the data. Only with experience can someone spew one of these off and expect it to work. (And with experience, if it doesn't work right away, both code and assumptions are checked for errors.)

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      Forgive me, but m/regex/gix was an oversimplification.

      To expand upon the logic and to be more accurate, I use:

      while($file =~ m/<DELIMITER>(.*?)<\/DELIMITER>/gs)

      to capture the text areas I need to search, and I use:

      if($searcharea =~ m/$regex/gm)

      to see if the selected areas of text contain any keywords.

        I notice you put in your example: <DELIMITER>(.*?)<\/DELIMITER>. Is this because you are working with very large XML files? Or is this simply your way of seperation?

        The reason I ask is because I have recently dealt with very large XML files, and found XML::Twig to be very helpful. You can read in smaller chunks of XML data at a time. You can then process it with the same ease as a tree based parser, such as XML::Simple. Once you are done processing that chunk, simply either flush (which prints the chunk) or purge (does not print) the data, freeing the memory.