Re: Processing LARGE text files

I second CountOrlok's suggestion.

You'll actually find that storing the whole file in memory will cause disk swapping at some point, slowing down your process. If instead you read in a manageable chunk at a time, the process will run about as fast as possible, spending most of the time reading and matching, and no time writing virtual memory to disk.

Also, I don't know what your file format is, but m//gix doesn't necessarily do the right thing across newlines.

In general, programs have to be designed to work with the data. Only with experience can someone spew one of these off and expect it to work. (And with experience, if it doesn't work right away, both code and assumptions are checked for errors.)

-QM
--
Quantum Mechanics: The dreams stuff is made of

Comment on Re: Processing LARGE text files Download Code

Replies are listed 'Best First'.
Re^2: Processing LARGE text files by Craig720 (Initiate) on Mar 07, 2006 at 19:46 UTC
Forgive me, but m/regex/gix was an oversimplification. To expand upon the logic and to be more accurate, I use: `while($file =~ m/<DELIMITER>(.*?)<\/DELIMITER>/gs)` [download] to capture the text areas I need to search, and I use: `if($searcharea =~ m/$regex/gm)` [download] to see if the selected areas of text contain any keywords.	[reply] [d/l] [select]
Re^3: Processing LARGE text files by thedoe (Monk) on Mar 07, 2006 at 21:19 UTC
I notice you put in your example: `<DELIMITER>(.*?)<\/DELIMITER>`. Is this because you are working with very large XML files? Or is this simply your way of seperation? The reason I ask is because I have recently dealt with very large XML files, and found XML::Twig to be very helpful. You can read in smaller chunks of XML data at a time. You can then process it with the same ease as a tree based parser, such as XML::Simple. Once you are done processing that chunk, simply either flush (which prints the chunk) or purge (does not print) the data, freeing the memory.	[reply] [d/l]
Re^4: Processing LARGE text files by Craig720 (Initiate) on Mar 08, 2006 at 14:55 UTC
The delimiters are words in angle braces such as <BOUNDARY> and </BOUNDARY>. Can the XML modules you mentioned be rigged to operate on very large text files containing non-standard XML? My experience with XML is minimal.	[reply]