in reply to Perl Code Runs Extremely Slow

#!/usr/bin/perl use strict; use warnings; my %file1_data; my %file2_data; print "\n\n"; # Process data files ProcessFile( "File1", \%file1_data ); ProcessFile( "File2", \%file2_data ); # Format output my @keys = sort keys %file1_data; foreach my $key ( @keys ) { my $target = exists $file2_data{$key} ? $file2_data{$key} : ''; print "$key $file1_data{$key} $target\n"; } # Process a file write data into supplied hash ref. sub ProcessFile { my $filename = shift; my $data = shift; # Pass data by reference - big hashes used +here. open( DATAFILE, '<', $filename ) or die "Unable to open $filename - $!\n"; # Store data in hash. # Only the last instance of any key is stored. while ( my $line = <DATAFILE> ) { my ($key, $value) = split /\s+/, $line, 2; # split into max o +f 2 fields. $data->{$key} = $value; } close( DATAFILE ) or die "Unable to close $filename - $!\n"; }

As others have said, we only open and read each file once. That is a huge savings.

I put the identical file processing code into a subroutine. I passed a hash reference to the subroutine to speed up data exchange--returning a hash would force a copy of each hash be made as data is passed back from the function. If you prefer to avoid the non-obvious munging of an argument the sub could look like:

my $file1_data = ProcessFile("File1"); my $file2_data = ProcessFile("File2"); # ... sub ProcessFile { my $filename = shift; my %data; #... return \%data; }

Both are about equally efficient, and which you use is a stylisitic issue.

Keep using strict and the 3 argument version of open. Use the warnings pragma rather than the '-w' switch.

Sorts are expensive, you want to do them the least number of times possible.

You may want to look into how to measure the computational complexity of an algorithm and 'big O notation'. In practice, I have found that the need to do this type of analysis is limited, but learning it will give you an intuitive sense of what types of things are costly--which will improve your algorithms.

Keep up the good work!


TGI says moo

Replies are listed 'Best First'.
Re^2: Perl Code Runs Extremely Slow
by samtregar (Abbot) on Jun 15, 2006 at 16:21 UTC
    This is the worst one yet! How much memory do you think he has? He'd have to be on a 64-bit platform with 8GB of ram for this script to have a chance of running to completion. Check out Devel::Size and think about what's going to happen when you put nearly 5GB of data into hashes!

    -sam

      Well, doesn't everyone have an S390?

      Assuming that all his keys are unique, you are right, the OP will need to use some sort of disk based storage--he probably doesn't have enought RAM. Perhaps an SQL database (SQLite anyone) or a dbm file would do the trick.

      Any reason why tying each hash to a DB_File would be a bad idea? The largest datasets I've had to deal with have only been in few 10s of megabytes in size. So, is DB_File up to the task?

      Would you care to offer a suggestion?

      Edit: Upon reviewing the OP's code, it looks like he is keeping both hashes in RAM at the same time (at least the worst case memory consumption for my code and the OPs is the same). If his code runs to completion as posted, so will mine. I also noted your suggestion to try DB_File, and so struck my request for suggestions.


      TGI says moo

        Review again - you are incorrect. The op loads all of file 2 into memory (each time he reads a line from file 1!) but never loads all of file 1. He makes it look like he does by putting lines from file 1 in a hash, but that hash is local to the while block and thus never contains more than one line. Since file 2 is much smaller than file 1 I think it's quite reasonable to assume that while he might fit file 2 in memory+swap that's unlikely to work with file 1.

        DB_File might work well enough, but it's hard to know. A lot depends on his definition of "fast enough" and the actual compositon of his data.

        -sam