While working on a complex script doing lookups and searches on a dozen arrays of hashes (each array representing a relational database table) I stumbled across an extremely simple improvement that instantly gave almost twice the performance.

The original loop looked like this:

sub filter { my $where = shift; my @in = @_; # This class method is used to filter an array of hashrefs against a + set of criteria defined in $where. # Example: # @matching_hosts = filter( { site => 56, type => 4 }, @all_hosts) +; # In this example, @matching_hosts will only contain those hashrefs +that would return TRUE for the following code: # ($_->{'site'} eq '56' && $_->{'type'} eq '4') # Note that the "eq" and "&&" are implied; no other operators are su +pported. # The order of the array is not affected. my @out = (); foreach my $record (@in) { my $keep = 1; foreach my $field (keys %{$where}) { unless ($record->{$field} eq $where->{$field}) { $keep = 0; last; } push @out, $record if $keep; } } return @out; }

The rewritten loop looks like this:

sub filter { my $where = shift; my @in = @_; # This class method is used to filter an array of hashrefs against a + set of criteria defined in $where. # Example: # @matching_hosts = filter( { site => 56, type => 4 }, @all_hosts) +; # In this example, @matching_hosts will only contain those hashrefs +that would return TRUE for the following code: # ($_->{'site'} eq '56' && $_->{'type'} eq '4') # Note that the "eq" and "&&" are implied; no other operators are su +pported. # The order of the array is not affected. my @out = (); # Make one pass per match term foreach my $field (keys %{$where}) { my $value = $where->{$field}; @out = grep { $_->{$field} eq $value } @in; @in = @out; # Prepare for next pass (if any) } return @out; }

The running times of actual reports dropped from over 4 seconds to less than 2 seconds. Some of that improvement obviously came from using the built-in grep{} function instead of manually checking each value and push()'ing hashrefs to the @out array, but I didn't expect that much of an improvement.

There had to be a different explanation, and that got me thinking about the cost of setting up and executing a foreach() loop:

$ cat foreach_inner #!/usr/bin/perl use strict; use warnings; foreach my $foo (1 .. 3) { foreach my $bar (1 .. 10000000) { my $pointless = "$foo.$bar"; } }
$ time ./foreach_inner real 0m8.975s user 0m8.954s sys 0m0.013s
$ cat foreach_outer #!/usr/bin/perl use strict; use warnings; foreach my $foo (1 .. 10000000) { foreach my $bar (1 .. 3) { my $pointless = "$foo.$bar"; } }
$ time ./foreach_outer real 0m14.106s user 0m14.092s sys 0m0.003s

Both test scripts do the exact same amount of (pointless) work, the difference between the two scripts is that 'foreach_inner' has to execute 9999997 more foreach() loops than 'foreach_outer'.

Sometimes, even a seemingly pointless improvement can make a significant difference if made in the right place.

Now, the way filters are specified in $where is pretty much nailed down because that hashref is built and used in a lot of different contexts. I am still looking for a way to express the whole thing as a single grep{} block to eliminate the looping altogether. Maybe tomorrow.

-- FloydATC

Time flies when you don't know what you're doing

Replies are listed 'Best First'.
Re: On optimizing nested loops
by RichardK (Parson) on Oct 19, 2014 at 12:42 UTC

    That's Interesting. Does using grep show any significant improvement over just swapping the order of the foreach loops in your first filter?

    I wonder if using each will have any impact? I'll have to give it a go and see.

    So maybe something like this?

    while (my ($field,$value) = each %{$where}) { @in = grep { $_->{$field} eq $value } @in; }
      I could not measure any difference between using foreach/keys and while/each.

      I'm thinking maybe there's a way to turn my $where hashref into a single expression usable by grep() without making it work significantly slower.

      -- FloydATC

      Time flies when you don't know what you're doing

        If @in is fairly long, then I'd certainly expect this:

        @in = grep { $_->{$field} eq $value } @in;

        To be measurably faster than:

        @out = grep { $_->{$field} eq $value } @in; @in = @out;

        Another optimization you could try is to note that each field you process cuts down the size of @in, so it makes sense to first process the fields that will cut it down the most. Filtering on gender eq "f" is probably going to cut out around 50% of @in, but filtering on surname eq "Davis" is likely to cut out around 99%.

        Right now, you're iterating through keys %$where in no particular order - i.e. hash order. Sorting keys %$where to prioritize keys where the data exhibits a high degree of variation ought to speed many searches up significantly, and the only cost will be sorting what is presumably a fairly small list of hash keys.

Re: On optimizing nested loops
by Anonymous Monk on Oct 19, 2014 at 10:08 UTC

    Why copy when you have aliases?

      Good point.

      -- FloydATC

      Time flies when you don't know what you're doing

Re: On optimizing nested loops
by Laurent_R (Canon) on Oct 25, 2014 at 15:20 UTC
    Your first code sample might run faster if you rewrote it this way:
    OUTER: foreach my $record (@in) { foreach my $field (keys %{$where}) { next OUTER unless $record->{$field} eq $where->{$field}; } push @out, $record; }
    But I cannot be sure and cannot test, not having the data.

    Update: I just tried to compare the two approaches with a far simpler data set (so it might not be really significant to your data).