Improving the efficiency of code when processed against large amount of data

greatshots has asked for the wisdom of the Perl Monks concerning the following question:

monks

opendir ( SSPR , "/apps/inst1/metrica/TechnologyPacks/ON-SITE/summarys
+pr/") or die "$!";
while ( defined ( $file_name = readdir(SSPR) ) ) {
    next if ( -d $file_name ); # removing . and ..
    open ( FH , "/apps/inst1/metrica/TechnologyPacks/ON-SITE/summarysp
+r/$file_name" ) or die "$!";
    $sspr_hash{$file_name} = [];
    @{$sspr_hash{$file_name}} = <FH>;
    map { $_ =~ s/[\n\r]//g } @{$sspr_hash{$file_name}};
}
[download]

I am using the above code to read all the files from a directory and storing it in a hashformat like, Key is the "filename" contents are its values. but it has more than 200 files each file contains more thann 2000 lines.

The hash using the above method is used as follows

    # this loop is to analysed the schema from summary spr files
    foreach $file_name ( keys %sspr_hash ) {
        foreach $line ( @{$sspr_hash{$file_name}} ) {
            if ( grep ( /$old_schema/i, $line ) ) {
                print "$old_schema|$new_schema_str|$found_status|$migr
+atestr|$rename_str|$file_name\n";
            }
        }
    }
[download]

In the above code each filename is read in a loop and the lines are read using another inner loop. The performance of the code is pretty slow. How can I improve the efficiency and increase the performance of code. please response me with your tips based on your experience you have met earlier with this kind of code.

Comment on Improving the efficiency of code when processed against large amount of data Select or Download Code

Replies are listed 'Best First'.
Re: Improving the efficiency of code when processed against large amount of data by madbombX (Hermit) on Nov 09, 2006 at 02:49 UTC
It has been my experience that the best way to improve the performance of code is to profile it. Find out where the most amount of time is spent. If the majority of the time is spent in file I/O, then you can't do much because the speed is limited by your seek/access times. However, by profiling, you can find out which lines can potentially be sped up. You could also consider forking off and running process in parallel (using Parallel::ForkManager). Back to profiling though...check out Devel::SmallProf or Devel::FastProf for line and subroutines profiling. Here is also a little HOWTO that I found useful for getting started (if it is your first attempt at code profiling): http://www.ddj.com/184404580.	[reply]
Re: Improving the efficiency of code when processed against large amount of data by duff (Parson) on Nov 09, 2006 at 05:34 UTC
To add to what the other monks have said ... There's an error and at least one weirdness in your code: the `next` line works by accident. It's checking if `$filename` in the current directory is a directory or not. If there happens to be a directory in the current directory that has the same name as a file in your summary spr directory, it will be skipped. your map-in-void-context looks like it might just be trying to act as `chomp`. If so, it's just `chomp(@{$sspr_hash{$file_name}} = <FH>);` with no `map` afterwards. `grep` works on lists but you're only using it on a single item. Valid but strange. Typically people would just do `$line =~ /$old_schema/i` Also, I'm not sure why you've divided your task into two parts like this. Is there a reason you're reading the contents of the files before you do any processing? If not, I'd just do it all in one go like this: #!/usr/bin/perl use warnings; use strict; my $ssprdir = "/apps/inst1/metrica/TechnologyPacks/ON-SITE/summaryspr/ +"; opendir(my $sspr, $ssprdir) or die "Can't read $ssprdir - $!\n"; my @files = grep { ! -d $_ } map { "$ssprdir/$_" } readdir $sspr; closedir $sspr; for my $file (@files) { unless (open(my $fh, "<", $file)) { warn "Unable to open $file - $!\n"; next; } # ... while (<$fh>) { next unless /$old_schema/i; # ... print "$old_schema\|$new_schema_str\|$found_status\|$migratestr\|$ +rename_str\|$file\n"; } close $fh; } __END__ [download] Also, if `$old_schema` is really supposed to be a string and not a regular expression, you might consider using `index` instead of engaging the RE engine. duff	[reply] [d/l] [select]
Re: Improving the efficiency of code when processed against large amount of data by GrandFather (Saint) on Nov 09, 2006 at 02:49 UTC
It's not clear why you need to chuck that stuff into a hash, but that should not have much effect on execution time compared to the file I/O. However there are a few foibles in your code that you should consider addressing: you should generally use the three parameter open: `open FH, '<', "..."` `map { $_ =~ s/[\n\r]//g } @{$sspr_hash{$file_name}};` is probably clearer as `$_ =~ s/[\n\r]//g for @{$sspr_hash{$file_name}};` `$_ =~` is redundant in `$_ =~ s/[\n\r]//g` you should always include `use strict; use warnings;` DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: Improving the efficiency of code when processed against large amount of data by zer (Deacon) on Nov 09, 2006 at 05:52 UTC
another thing that i found sped my code up for larger files was to avoid slurping and go line by line `foreach (<FH>){ print $_; }` [download] This should work.	[reply] [d/l]
Re^2: Improving the efficiency of code when processed against large amount of data by chromatic (Archbishop) on Nov 09, 2006 at 08:16 UTC
Only while loops avoid slurping though! for loops slurp.	[reply]
Re^2: Improving the efficiency of code when processed against large amount of data by aufflick (Deacon) on Nov 09, 2006 at 06:00 UTC
Slurping all the files before processing means they are all in ram. If you are light on ram that means you're swapping the data in, then out, then back into memory again. Line by line (or at least file by file) is usually the way to go for large datasets.	[reply]
Re^3: Improving the efficiency of code when processed against large amount of data by zer (Deacon) on Nov 09, 2006 at 06:17 UTC
good explanation thanks!	[reply]