swissknife has asked for the wisdom of the Perl Monks concerning the following question:

Hello Dear Monks

in my script i am using a grep against each element in the array. this array containts around 8000 elements now and processing of each element is taking much time. this causes an script timeout on my remote system from where i execute this script. i am wondering if there is a way to improve the performance time? Below are the snippet of the code in question

my $path = "/tmp/testpatch"; opendir DIR, $path or die $!; my @tempfiles = readdir DIR; #this array has list of all the files, p +reviously processed and newly added (8000 files) closedir DIR; foreach my $strfile (@tempfiles) { if (!grep /$strfile/, @arraytocompare) #This array has list of files w +hich were processed in the passed { push (@newarray, $strfile); # in this array i get all the new files wh +ich i need to process now. } }

is there a fast grep? or i will have to live with it?

Replies are listed 'Best First'.
Re: improve performance
by Corion (Patriarch) on Jun 08, 2015 at 09:45 UTC

    Whenever you want a fast lookup, try to use a hash. This will only work if your elements exist as single lines instead of actual substring matches. Also note that if $strfile contains regex meta characters (like *, . or +, or []) your code will not work as you might think.

    Using a hash would result in:

    my %lookup = map { $_ => 1 } @arraytocompare; foreach my $strfile (@tempfiles) { if( $lookup{ $strfile } ) { push (@newarray, $strfile); }; };

    But I really, really doubt that searching through 8000 array entries will slow your program down that much. Are you certain that this is where your performance bottleneck is?

      Thanks Corion. I added few prints and used executed the script using -d option. which clearly shows that this is where the performance bottelneck is.

      you said if $strfile contains regex meta charachters my code will not will not work as i might have thought... file name has "." before the extension of file. is this statement is valid even if i use hash?

        You may want to take a look at Devel::NYTProf for profiling your code.

      I updated the codes with your suggestion which is really faster but does not give the same result as the grep. i find @newarray empty where as it is not. did you consider the NOT operator ! in my original code?

        No, I did not consider it, but you can consider it in your code.

Re: improve performance
by pme (Monsignor) on Jun 08, 2015 at 09:53 UTC
    You can transform @arraytocompare to a hash as you can see below. Hash key lookup is very effective.
    my $hashtocompare{$_}++ for (@arraytocompare); foreach my $strfile (@tempfiles) { push @newarray, $strfile unless exists $hashtocompare{$strfile}; }
Re: improve performance
by GotToBTru (Prior) on Jun 08, 2015 at 16:34 UTC

    Consider if you can use the none function from List::Util for operations like (! grep ). grep will check the entire list to find all the values that match, but in this case, once you have found a match you might as well stop. Some of the List::Util functions, like first or any or none, will short circuit.

    In this example, the variable $j shows how many times the loop body is executed.

    use strict; use warnings; use List::Util qw(none); my @list = 1..100; my $j = 0; if (! grep { $j++; $_ > 10 } @list ) { print 'Didn\'t find any values above 10 ... ' } else { print 'Found some values above 10 ... ' } print "but I had to look at $j values to be sure.\n"; $j = 0; if (none { $j++; $_ > 10 } @list) { print 'Didn\'t find any values above 10 ... ' } else { print 'Found some values above 10 ... ' } print "but I had to look at $j values to be sure.\n";

    Output:

    Found some values above 10 ... but I had to look at 100 values to be s +ure. Found some values above 10 ... but I had to look at 11 values to be su +re.

    Update: forgot to copy the code that actually produces that output!

    Dum Spiro Spero
Re: improve performance
by Anonymous Monk on Jun 08, 2015 at 09:46 UTC

    I suspect that your use of $strfile in a regex may not be the best way to go about it, since it would make more sense if you were looking for exact matches, e.g. grep {$_ eq $strfile} @arraytocompare

    If you know the filenames are unique (usually a fairly safe bet), you can use a hash instead of @arraytocompare, or convert it to one with something like my %tocompare = map {$_=>1} @arraytocompare;, and then test against that via if (!$tocompare{$strfile}) ....

Re: improve performance
by marioroy (Prior) on Jun 12, 2015 at 13:23 UTC

    Update: Added simulation.

    Update: Changed to ! exists ...

    Looping and grep'ing for each temp file is likely expensive. This populates @newarray with new files only.

    # this hash has list (keys) of all the files previously processed my %processed = map { $_ => 1 } @arraytocompare; my $path = "/tmp/testpatch"; opendir DIR, $path or die $!; # in this array i get all the new files which i need to process now my @newarray = map { ! exists $processed{$_} ? $_ : () } readdir DIR; closedir DIR;

    The above is simulated below and outputs 8000 taking just a fraction of a second to complete.

    # this hash has list (keys) of all the files previously processed my %processed = map { $_ => 1 } 100001 .. 108000; # in this array i get all the new files which i need to process now my @newarray = map { ! exists $processed{$_} ? $_ : () } 108001 .. 116 +000; print scalar @newarray, "\n";