greatshots has asked for the wisdom of the Perl Monks concerning the following question:

monks

opendir ( SSPR , "/apps/inst1/metrica/TechnologyPacks/ON-SITE/summarys +pr/") or die "$!"; while ( defined ( $file_name = readdir(SSPR) ) ) { next if ( -d $file_name ); # removing . and .. open ( FH , "/apps/inst1/metrica/TechnologyPacks/ON-SITE/summarysp +r/$file_name" ) or die "$!"; $sspr_hash{$file_name} = []; @{$sspr_hash{$file_name}} = <FH>; map { $_ =~ s/[\n\r]//g } @{$sspr_hash{$file_name}}; }
I am using the above code to read all the files from a directory and storing it in a hashformat like, Key is the "filename" contents are its values. but it has more than 200 files each file contains more thann 2000 lines.

The hash using the above method is used as follows
# this loop is to analysed the schema from summary spr files foreach $file_name ( keys %sspr_hash ) { foreach $line ( @{$sspr_hash{$file_name}} ) { if ( grep ( /$old_schema/i, $line ) ) { print "$old_schema|$new_schema_str|$found_status|$migr +atestr|$rename_str|$file_name\n"; } } }
In the above code each filename is read in a loop and the lines are read using another inner loop. The performance of the code is pretty slow. How can I improve the efficiency and increase the performance of code. please response me with your tips based on your experience you have met earlier with this kind of code.

Replies are listed 'Best First'.
Re: Improving the efficiency of code when processed against large amount of data
by madbombX (Hermit) on Nov 09, 2006 at 02:49 UTC
    It has been my experience that the best way to improve the performance of code is to profile it. Find out where the most amount of time is spent. If the majority of the time is spent in file I/O, then you can't do much because the speed is limited by your seek/access times. However, by profiling, you can find out which lines can potentially be sped up. You could also consider forking off and running process in parallel (using Parallel::ForkManager). Back to profiling though...check out Devel::SmallProf or Devel::FastProf for line and subroutines profiling. Here is also a little HOWTO that I found useful for getting started (if it is your first attempt at code profiling): http://www.ddj.com/184404580.
Re: Improving the efficiency of code when processed against large amount of data
by duff (Parson) on Nov 09, 2006 at 05:34 UTC

    To add to what the other monks have said ...

    There's an error and at least one weirdness in your code:

    1. the next line works by accident. It's checking if $filename in the current directory is a directory or not. If there happens to be a directory in the current directory that has the same name as a file in your summary spr directory, it will be skipped.
    2. your map-in-void-context looks like it might just be trying to act as chomp. If so, it's just chomp(@{$sspr_hash{$file_name}} = <FH>); with no map afterwards.
    3. grep works on lists but you're only using it on a single item. Valid but strange. Typically people would just do $line =~ /$old_schema/i
    Also, I'm not sure why you've divided your task into two parts like this. Is there a reason you're reading the contents of the files before you do any processing? If not, I'd just do it all in one go like this:
    #!/usr/bin/perl use warnings; use strict; my $ssprdir = "/apps/inst1/metrica/TechnologyPacks/ON-SITE/summaryspr/ +"; opendir(my $sspr, $ssprdir) or die "Can't read $ssprdir - $!\n"; my @files = grep { ! -d $_ } map { "$ssprdir/$_" } readdir $sspr; closedir $sspr; for my $file (@files) { unless (open(my $fh, "<", $file)) { warn "Unable to open $file - $!\n"; next; } # ... while (<$fh>) { next unless /$old_schema/i; # ... print "$old_schema|$new_schema_str|$found_status|$migratestr|$ +rename_str|$file\n"; } close $fh; } __END__

    Also, if $old_schema is really supposed to be a string and not a regular expression, you might consider using index instead of engaging the RE engine.

Re: Improving the efficiency of code when processed against large amount of data
by GrandFather (Saint) on Nov 09, 2006 at 02:49 UTC

    It's not clear why you need to chuck that stuff into a hash, but that should not have much effect on execution time compared to the file I/O.

    However there are a few foibles in your code that you should consider addressing:

    • you should generally use the three parameter open: open FH, '<', "..."
    • map { $_ =~ s/[\n\r]//g } @{$sspr_hash{$file_name}}; is probably clearer as $_ =~ s/[\n\r]//g for @{$sspr_hash{$file_name}};
    • $_ =~ is redundant in $_ =~ s/[\n\r]//g
    • you should always include use strict; use warnings;

    DWIM is Perl's answer to Gödel
Re: Improving the efficiency of code when processed against large amount of data
by zer (Deacon) on Nov 09, 2006 at 05:52 UTC
    another thing that i found sped my code up for larger files was to avoid slurping and go line by line
    foreach (<FH>){ print $_; }
    This should work.

      Only while loops avoid slurping though! for loops slurp.

      Slurping all the files before processing means they are all in ram. If you are light on ram that means you're swapping the data in, then out, then back into memory again.

      Line by line (or at least file by file) is usually the way to go for large datasets.

        good explanation thanks!