halecommarachel has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,
I am trying to speed up the parsing of thousands of files, each large in size. I use open, a while loop for each line, and a bunch of if/elsif statements with regular expressions to store relevant information into a hash. Is there a different way to go through the files that's less expensive? Thanks!

Replies are listed 'Best First'.
Re: Parsing large files
by Loops (Curate) on Aug 06, 2013 at 02:20 UTC
    If you are IO bound already, meaning that you're processing files as fast as the drives can deliver them, there is little to be done without new hardware. However, if the bottleneck is in the calculations you're performing, adding some parallelism to your script will utilize any spare CPU capacity. Of course there may be a problem in your current code that makes it unnecessarily slow, sharing it here might lead to improvements even in the single threaded case.
Re: Parsing large files
by nevdka (Pilgrim) on Aug 06, 2013 at 01:42 UTC

    Unless the file is indexed, I don't think there's a way to avoid reading in the whole thing line by line. But, if you're using regexes inside the while loop, you might be able to speed up your parsing by pre-compiling them using qr.

Re: Parsing large files
by Anonymous Monk on Aug 06, 2013 at 01:14 UTC

    Is there a different way to go through the files that's less expensive? Thanks!

    No. There is only one way to read files, and that is to actually to read them. If this is too slow , buy a faster harddisk.

Re: Parsing large files
by locked_user sundialsvc4 (Abbot) on Aug 06, 2013 at 14:27 UTC

    The predecessor-to-Perl tool for doing this sort of thing was called awk, and you can see that its programs consist entirely of a set of regular-expressions followed by what to do when one of those was encountered in the text file.

    One thing to be very mindful of, though, is how much RAM you have and how much RAM is being consumed by the program over time.   In particular, all of the memory required in the processing of one file should be completely released back to Perl before processing of the next file begins.   Although Perl does not release memory back to the operating system, nevertheless you should not see that the working-set size (WSS) of the process “continues to increase endlessly” as additional files are processed.   If it does the process can start “thrashing” and you will see this when the program (and the entire computer) grinds to a halt except for your disk-drive, whose little light never goes out.