dramguy has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I have a huge file (4.2GB) which contains entries similar to this:

Length_meas_C1C2 0.22
0.00 00.000 .090
Length_meas_C1C2 0.18
0.00 00.000 .090
0.00 00.000 .090
Length_meas_C1C2 0.18
Length_meas_C1C2 0.18
0.00 00.000 .090
Length_meas_C1C2 0.18

I am trying to parse this file and count the number of
times a certain layer number is found. Here is the
code I have...this works fine for smaller files, however
I am getting the out of memory error for the actual
data files. Can anyone offer any suggestions on how to
avoid this issue? Thanks in advance!

#!/opt/perl/5.8.7-32bit/bin/perl my $inFile = $ARGV[0]; my %CALHASH; my @CALARR; if(!$ARGV[0]) {exit;} open(FH, "<$inFile"); foreach $line (<FH>) { chomp($line); #remove newline from end of line if($line =~ /(\w+_meas_.*)/) { ##($layer, $enc) = split(' ', $1); $layer = $1; #print "$layer, $enc\n"; if (exists $CALHASH{$layer}) { $CALHASH{$layer}{'freq'}++; } else { $CALHASH{$layer}{'freq'} = 1; } } } foreach $key (keys %CALHASH) { print "$CALHASH{$key}{'freq'} $key\n"; }

Replies are listed 'Best First'.
Re: Out of memory!!??
by naikonta (Curate) on May 25, 2007 at 15:12 UTC
    foreach $line (<FH>) {
    Basically, this code tries to read all file content at once. That's what <> does when evaluated in list context. Try to read the file line by line. Saying,
    while (<FH>) {
    is the same as
    while (defined($_ = <FH>)) {
    In code above, <FH> is evaluated in scalar context and the <> operator will return line by line until it reaches end of file. See perlop for more detail.

    Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

      Thanks!! This did the trick.
      I'd still like to play around with the DBI package to see if speed is improved.
      Thanks again.
Re: Out of memory!!??
by salva (Canon) on May 25, 2007 at 15:25 UTC
    You are reading the full file in memory before processing it. Use while to loop instead of foreach.

    Also, you are using a hash of hashes to store the counters when a simple hash will do and reduce the memory consumption an order of magnitude:

    @ARGV == 1 or die "Usage: ..."; my $inFile = $ARGV[0]; my %freq; open(FH, "<$inFile"); while (<FH>) { chomp; if(/(\w+_meas_.*)/) { ##($layer, $enc) = split(' ', $1); $layer = $1; #print "$layer, $enc\n"; $freq{$layer}++; } } foreach $key (keys %freq) { print "$freq{$key} $key\n"; }
Re: Out of memory!!??
by derby (Abbot) on May 25, 2007 at 15:01 UTC

    Add more memory to the machine or use something like DB_File.

    -derby
Re: Out of memory!!??
by jettero (Monsignor) on May 25, 2007 at 15:02 UTC

    You can combine Storable with DB_File to create a memory-low but disk using solution. You only need something like Storable because your hash is multi-level. If you can avoid that then all you need is DB_File.

    There are suites that store deep structures automatically, but they don't come with perl — which is sometimes an issue on platforms where perl is installed under /opt/.

    Otherwise, have a gander at DBM::Deep.

    -Paul

Re: Out of memory!!??
by zentara (Cardinal) on May 25, 2007 at 15:07 UTC
    It seems to me your script should work, since you are reading it line-by-line. My guess is your Perl or OS dosn't have "large-file-support". See Large File Support

    Do a "perl -V" and see if you can find the phrase "USE_LARGE_FILES".


    I'm not really a human, but I play one on earth. Cogito ergo sum a bum
      It seems to me your script should work, since you are reading it line-by-line. My guess is your Perl or OS dosn't have "large-file-support".

      But is it possible that no one thus far has noticed the

      foreach $line (<FH>) {

      line in the OP's code?!? Well, maybe the problem will still be there, but there's a reason why we recommend against doing so all the time. To the OP: just try using a while loop instead. Until you have a fully functional Perl 6 installation available, that is!

      Update: naikonta noticed.

Re: Out of memory!!??
by cengineer (Pilgrim) on May 25, 2007 at 16:23 UTC
Re: Out of memory!!??
by moritz (Cardinal) on May 27, 2007 at 08:15 UTC
    This regex: /(\w+_meas_.*)/ might not be the best choice if the file is not very uniform, and there are long lines that match the pattern. If you think you know that anything behind the 'meas_' is at most 100 chars, you can use /(\w+_meas_.{0,100})/

    If you have some very long lines, that prevents them from being all stored in the hash.

    This is not your main problem, but might be a precaution anyway.

    As a side not, the if (exists ... code is superfluous, you can just as well increment $CALHASH{$layer}{'freq'} if the entry exists or not.