Isoparm has asked for the wisdom of the Perl Monks concerning the following question:

I have written a perl script that reads in a large amount of text from a file - it chomps this file up into individual lines. I then check the lines against some code. After I check the line, I write it to a "done" file so I know what has been completed. I then splice that line out of the original array.

When the program loads it also loads the "done" file into a hash. This is how I check to see if the data line has already been examined. If it's in the hash, then the program skips to the next line of data. (That way if I stop the program or the computer crashes, I'm not starting from scratch).

The problem I'm finding is that with large sets of data my perl script quickly takes up a ton of memory.

Is there a way to do this in a manor that won't utilize so much memory?

Thanks
  • Comment on Best way to manage memory when processing a large file?

Replies are listed 'Best First'.
Re: Best way to manage memory when processing a large file?
by ikegami (Patriarch) on Sep 16, 2011 at 05:51 UTC

    You talk about loading the whole file into an array even though you process it line by line. This is obviously a waste of memory. Read the file a line at a time.

    So that leaves the size of the hash. You could start by using a more efficient data structure, such as Judy::HS. If that isn't enough, you could use a disk-based solution such as DB_File.

Re: Best way to manage memory when processing a large file?
by zentara (Cardinal) on Sep 16, 2011 at 10:25 UTC
Re: Best way to manage memory when processing a large file?
by kcott (Archbishop) on Sep 16, 2011 at 05:37 UTC

    For processing a large file without having to worry about memory issues, I'd recommend Tie::File.

    What you're describing with regard to your done file, doesn't sound like it's the best solution. With Tie::File, you can simply remove the records when they're processed.

    -- Ken

Re: Best way to manage memory when processing a large file?
by Marshall (Canon) on Sep 16, 2011 at 23:50 UTC
    It would be very helpful if you could explain more about what you are doing (in term of the application) and how much processing it takes for each line of the input to reach the "done" status? and how many lines that you are processing.

    From your description, when you finish processing all lines in the input file, there will be an equal number of lines in the output file, i.e. one input line results in one output line. True?

    I am going to answer your question, but I have to say that many times the right answer is make the processing so darn fast that the ability to restart processing from some middle point doesn't matter at all (i.e. just starting over is ok).

    From the description, I see no reason to keep any significant amount of data in memory, much less a "lot".

    Read each from the input one line at a time, process it, write it to the done file, read next line. The only data in memory is the current line.

    If the program crashes, for the error recovery, open the "done" file for read only. Start a new done file to write to. Copy lines from the previous "done" file to the "new done" file, but always keep one line behind. The last "done" line may be corrupt and rather than try to figure out whether it is good or not, just throw it away. Close the previous "done" file.

    Now we know how many "good lines" have been processed. So open the input file and just skip that number of lines. BTW, Perl keeps track of line in the $. variable, but a simple line counter is a very "cheap" thing to implement.

    What I have described is a very simple stategy, but the devil is in the details. What happens if there is a crash while the recovery of a previous crash is going on? It turns out that is not just hypothetical, lots of crashes are caused by systemic errors that your code has no control over. And there are all sorts of issues concerning cleaning up intermediate files.

    Anyway, one of my points is that adding the error recovery will either impact the performance or the complexity of the application and probably both.

    I would strong recommend recoding the application so that it uses very little memory and runs as fast as possible. The error recovery stuff will add 10x to the complexity or the code.

    Post some "lean and mean" Perl code with some detail of the app. The first focus should be to make it so fast that it doesn't have to be restarted in the middle. If that cannot be achieved, then let's talk about the restart code and to how to implement it efficiently and reliably.

Re: Best way to manage memory when processing a large file?
by Kc12349 (Monk) on Sep 16, 2011 at 15:43 UTC

    As to the efficiency of your data structure in memory, I'll leave that to others. Being as I forget from time to time and am stung by the lack of defined in the below, here is a basic while loop for reading line by line.

    The defined call ensures your loop does not break prematurely should the line evaluate to a false value.

    use autodie; open(my $fh, '<', $file); while ( defined (my $line = <$fh>) ) { chomp($line); # do stuff }