Processing LARGE text files

Craig720 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Processing LARGE text files by CountOrlok (Friar) on Mar 07, 2006 at 18:03 UTC
Don't read line by line or slurp in the whole file. Read in one delimited section at a time. For example if your delimiter is "end of record\n", do this: `local $/ = "end of record\n"; while(<>) { # process a delimited section if matches your criteria }` [download] -imran	[reply] [d/l]
Re^2: Processing LARGE text files by Craig720 (Initiate) on Mar 07, 2006 at 20:39 UTC
Thanks for the reply. Your suggestion does make sense. I think I tried 'chunking' once. Didn't work out too well. I experienced 'Sudden Flaming Death' -- my error message. I'll have to give your method another try in the morning when I'm fresh. Thanks for the tip. I'll see what happens.	[reply]
Re: Processing LARGE text files by zentara (Cardinal) on Mar 07, 2006 at 18:16 UTC
Tie::File is often recommended for handling large files, see grab 'n' lines from a file above and below a /match/ for example. I'm not really a human, but I play one on earth. flash japh	[reply]
Re^2: Processing LARGE text files by Craig720 (Initiate) on Mar 07, 2006 at 19:58 UTC
I tried Tie::File quite some time ago. Unless I was using it incorrectly, I found it altered the source file itself. If I was using it correctly, then I cannot use Tie::File. I need the original documents from which I extract the data to be untouched. When processed data is posted to our website, the unaltered source document is also posted. A link to that source is posted adjacent to the processed data.	[reply]
Re^3: Processing LARGE text files by zentara (Cardinal) on Mar 07, 2006 at 20:59 UTC
I don't think you were using it correctly, or you had some other bug in your code. Are you suggesting that Tie::File has a bug? Look at the section 'mode' in "perldoc Tie::File" to see how to keep your file read-only. I'm not really a human, but I play one on earth. flash japh	[reply]
Re^4: Processing LARGE text files by Craig720 (Initiate) on Mar 08, 2006 at 14:40 UTC
Re: Processing LARGE text files by QM (Parson) on Mar 07, 2006 at 18:20 UTC
I second CountOrlok's suggestion. You'll actually find that storing the whole file in memory will cause disk swapping at some point, slowing down your process. If instead you read in a manageable chunk at a time, the process will run about as fast as possible, spending most of the time reading and matching, and no time writing virtual memory to disk. Also, I don't know what your file format is, but `m//gix` doesn't necessarily do the right thing across newlines. In general, programs have to be designed to work with the data. Only with experience can someone spew one of these off and expect it to work. (And with experience, if it doesn't work right away, both code and assumptions are checked for errors.) -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l]
Re^2: Processing LARGE text files by Craig720 (Initiate) on Mar 07, 2006 at 19:46 UTC
Forgive me, but m/regex/gix was an oversimplification. To expand upon the logic and to be more accurate, I use: `while($file =~ m/<DELIMITER>(.*?)<\/DELIMITER>/gs)` [download] to capture the text areas I need to search, and I use: `if($searcharea =~ m/$regex/gm)` [download] to see if the selected areas of text contain any keywords.	[reply] [d/l] [select]
Re^3: Processing LARGE text files by thedoe (Monk) on Mar 07, 2006 at 21:19 UTC
I notice you put in your example: `<DELIMITER>(.*?)<\/DELIMITER>`. Is this because you are working with very large XML files? Or is this simply your way of seperation? The reason I ask is because I have recently dealt with very large XML files, and found XML::Twig to be very helpful. You can read in smaller chunks of XML data at a time. You can then process it with the same ease as a tree based parser, such as XML::Simple. Once you are done processing that chunk, simply either flush (which prints the chunk) or purge (does not print) the data, freeing the memory.	[reply] [d/l]
Re^4: Processing LARGE text files by Craig720 (Initiate) on Mar 08, 2006 at 14:55 UTC