Efficiently processing a file

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I need to generate statistics from a very large text file (1GB+) and need to process the file line by line. I originally had something like this:

open(FILE, "<$ARGV[0]") || die "$!";
count_bases(20);
count_bases(30);
...
sub count_bases {
seek FILE, 0, SEEK_SET; #reset file pointer
my $min_quality = shift;

foreach $line (<FILE>) #main loop
{ 
  my @tmp_ar = split("\t", $line);
  ... 
  #make comparisons and put #s in hash 
}
[download]

but this ended up being too slow. I run this on a remote server (which I do not have admin) and it always gets killed after locking up the server for a couple of minutes. I looked over a similar script (that I did not write) and noticed that they used while(<>) to read things in. This script was not killed by the server when processing the same large file. So I make the following changes

count_bases(20, $ARGV[1]);
count_bases(30, $ARGV[1]);
...
sub count_bases {
my $min_quality = shift;

while(<>)
{ 
  my @tmp_ar = split();
  ... 
  #make comparisons and put #s in hash 
}
[download]

This however does not work for the 2nd count_bases call, it just freezes after the 1st. I am guessing that the while(<>) takes from @ARGV and not the values that I pass into the function so after it uses the 1st time it never starts the end.

my questions: why is the implicit open so much faster? is there a way to get implicit open to work twice? is there a way to make the 1st way fast without implicit open?

thanks

Comment on Efficiently processing a file Select or Download Code

Replies are listed 'Best First'.
Re: Efficiently processing a file by jwkrahn (Abbot) on Jan 25, 2010 at 22:28 UTC
need to process the file line by line. I originally had something like this: `... foreach $line (<FILE>) #main loop {` [download] `foreach` works on lists so it reads the entire file into a list in memory and then iterates through the list. To properly read just one line at a time you need to use a `while` loop instead: `while ( my $line = <FILE> ) #main loop {` [download] I am guessing that the while(<>) takes from @ARGV and not the values that I pass into the function Yes, this is coorect. The `<>` operator is special in that it reads through all the files in `@ARGV` and each file name is removed from `@ARGV` and put into `$ARGV` while that file is being processed.	[reply] [d/l] [select]
Re^2: Efficiently processing a file by Anonymous Monk on Jan 25, 2010 at 22:37 UTC
I decided to make an account since I could not edit my previous posts w/out one. Thanks, I did not know that foreach reads everything into memory first, that was probably what was causing the server to kill the program. Would there be an appreciable speed increase if I called the function by reference? When would you usually call a function by reference? `&count_bases(20); &count_bases(30);` [download]	[reply] [d/l]
Re^3: Efficiently processing a file by rkrieger (Friar) on Jan 26, 2010 at 07:51 UTC
The foreach loop is hungry in that it stores all lines in memory. A while loop can avoid that by merely nibbling at the cake until its gone. Without knowing the actual details you face, I assume the speed issue is one of reading the file rather than processing individual lines. That's the key bit others pointed out. While I do not know what your count_bases() function intends to do, splitting the reading and processing steps may help you. I assume you want to do regex matching or something else that can be done on individual lines. `while (my $line = <>) { foreach my $count (20, 30) { count_bases($line, $count); } }` [download] The foreach loop makes it easy to extend the particular processing steps on individual lines (e.g. through a dispatch table with code references). The while loop simply keeps running as long as there is input, not trying to store it all in memory at once. Code references (a reference to 'code') are very useful for dispatch tables: they allow you to easily parametrize behaviour of your program. You can store them in arrays or hashes and later on loop over those arrays to ensure all (or only specific) actions are taken. Higher Order Perl by Mark Jason Dominus has a chapter that I find quite instructive. Read more... (1041 Bytes)	[reply] [d/l] [select]
Re^3: Efficiently processing a file by jwkrahn (Abbot) on Jan 25, 2010 at 22:45 UTC
Putting an ampersand (&) in front of your subroutine names like that does not call the function by reference. In your example there is no difference with or without the ampersand. See perlsub for an explanation of what the ampersand does.	[reply]
Re^3: Efficiently processing a file by Anonymous Monk on Jan 26, 2010 at 17:03 UTC
Rather than saying "foreach reads everything into memory first," it's more accurate to say that the `<FH>` operator behaves differently in scalar and list contexts. Scalar context returns the next line of the file - list context returns all of them. When to call by reference, and when not, is a little complicated, but a good rule of thumb is that if you're not absolutely sure what you're doing, you definitely shouldn't call by reference. If you are absolutely sure, you maybe still shouldn't.	[reply] [d/l]
Re: Efficiently processing a file by ahmad (Hermit) on Jan 25, 2010 at 22:27 UTC
basically what you want is something like this: `open(F,$ARGV[0]) or die $!; while ( <F> ) { # remove \n chomp; # process the current line. } close(F);` [download]	[reply] [d/l]
Re: Efficiently processing a file by Anonymous Monk on Jan 25, 2010 at 22:26 UTC
I ment to write $ARGVΏ] for for 1st code example. I was also wondering if I should be using foreach or while in this situation and if it would be faster to call the subroutine by reference `count_bases(20, $ARGV[1]); count_bases(30, $ARGV[1]);` [download]	[reply] [d/l]
Re^2: Efficiently processing a file by ahmad (Hermit) on Jan 25, 2010 at 22:33 UTC
Why you're using that sub in the first place ? what you're trying to do it not clear (for me). You can test multiple conditions against the current line instantly then loop & get the next line. but instead I see that you're SEEKING to the beginning of the file each time you want to check for something which is not efficient	[reply]