Best way to manage memory when processing a large file?

Isoparm has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Best way to manage memory when processing a large file? by ikegami (Patriarch) on Sep 16, 2011 at 05:51 UTC
You talk about loading the whole file into an array even though you process it line by line. This is obviously a waste of memory. Read the file a line at a time. So that leaves the size of the hash. You could start by using a more efficient data structure, such as Judy::HS. If that isn't enough, you could use a disk-based solution such as DB_File.	[reply]
Re: Best way to manage memory when processing a large file? by zentara (Cardinal) on Sep 16, 2011 at 10:25 UTC
For the hash end of things, you might want to look at DBM::Deep, see How to save memory, parsing a big file. I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply]
Re: Best way to manage memory when processing a large file? by kcott (Archbishop) on Sep 16, 2011 at 05:37 UTC
For processing a large file without having to worry about memory issues, I'd recommend Tie::File. What you're describing with regard to your done file, doesn't sound like it's the best solution. With Tie::File, you can simply remove the records when they're processed. -- Ken	[reply]
Re: Best way to manage memory when processing a large file? by Marshall (Canon) on Sep 16, 2011 at 23:50 UTC
It would be very helpful if you could explain more about what you are doing (in term of the application) and how much processing it takes for each line of the input to reach the "done" status? and how many lines that you are processing. From your description, when you finish processing all lines in the input file, there will be an equal number of lines in the output file, i.e. one input line results in one output line. True? I am going to answer your question, but I have to say that many times the right answer is make the processing so darn fast that the ability to restart processing from some middle point doesn't matter at all (i.e. just starting over is ok). From the description, I see no reason to keep any significant amount of data in memory, much less a "lot". Read each from the input one line at a time, process it, write it to the done file, read next line. The only data in memory is the current line. If the program crashes, for the error recovery, open the "done" file for read only. Start a new done file to write to. Copy lines from the previous "done" file to the "new done" file, but always keep one line behind. The last "done" line may be corrupt and rather than try to figure out whether it is good or not, just throw it away. Close the previous "done" file. Now we know how many "good lines" have been processed. So open the input file and just skip that number of lines. BTW, Perl keeps track of line in the $. variable, but a simple line counter is a very "cheap" thing to implement. What I have described is a very simple stategy, but the devil is in the details. What happens if there is a crash while the recovery of a previous crash is going on? It turns out that is not just hypothetical, lots of crashes are caused by systemic errors that your code has no control over. And there are all sorts of issues concerning cleaning up intermediate files. Anyway, one of my points is that adding the error recovery will either impact the performance or the complexity of the application and probably both. I would strong recommend recoding the application so that it uses very little memory and runs as fast as possible. The error recovery stuff will add 10x to the complexity or the code. Post some "lean and mean" Perl code with some detail of the app. The first focus should be to make it so fast that it doesn't have to be restarted in the middle. If that cannot be achieved, then let's talk about the restart code and to how to implement it efficiently and reliably.	[reply]
Re: Best way to manage memory when processing a large file? by Kc12349 (Monk) on Sep 16, 2011 at 15:43 UTC
As to the efficiency of your data structure in memory, I'll leave that to others. Being as I forget from time to time and am stung by the lack of `defined` in the below, here is a basic while loop for reading line by line. The `defined` call ensures your loop does not break prematurely should the line evaluate to a false value. `use autodie; open(my $fh, '<', $file); while ( defined (my $line = <$fh>) ) { chomp($line); # do stuff }` [download]	[reply] [d/l] [select]