Re^2: Optimize a perl block

G'day Eily,

++ I tried a few things independently; however, it appears that much of that is very similar to what you've done, so I'll post it here for comparison.

"my @keys = keys %$hash_ref;" was one of my first thoughts and this appeared to be a definite winner. I also tried inlining the results (now discarded, but it was something like: "sub hrkeys () { keys %$hash_ref }"): that proved to be slower than using "@keys".

I had code very similar to your "sum map ..."; although, I used sum0. That appeared to be slower (even with the "map EXPR" form I used); I suspect any gains from sum were overshadowed by losses from map; I didn't investigate that any further.

I didn't think of caching. That's a good idea, and might put "sum(0) map" back in the picture; however, as you stated, that will depend on the OP's data (which hasn't been shown).

I dummied up some test data (based on the OP's description but, I'm sure, far from representative); ran some basic timings; and included some sanity checking. Here's the code I found to be fastest.

#!/usr/bin/env perl -l

use strict;
use warnings;

use Time::HiRes 'time';

my $hash_ref;
@$hash_ref{'a' .. 'j'} = 1 .. 10;
my $array_ref = [ ('v-w-x-y-z') x 2e6 ];
my $foo;
my $value = 0;
for my $outer ('v' .. 'z') {
    $foo->{$outer}{$_}{value} = ++$value for 'a' .. 'j';
}

my $t0 = time;
op_code();
my $t1 = time;
printf "op_code:   %.6f\n", $t1 - $t0;
kens_code();
my $t2 = time;
printf "kens_code: %.6f\n", $t2 - $t1;
print '*** Compare ***';
printf "kens/op:   %.6f%%\n", (($t2 - $t1) / ($t1 - $t0)) * 100;

sub op_code {
    my $bar;

    foreach my $a (@{ $array_ref }) {
        my $i = 0;
        foreach my $b (split('-', $a)) {
            foreach my $c (keys %{ $hash_ref }) {
                $i += $foo->{$b}->{$c}->{'value'};
            }
        }
        push @{ $bar->{$i} }, $a;
    }

    print '*** op_code ***';
    print "@{[ $_, $#{$bar->{$_}}, $bar->{$_}[0] ]}" for keys %$bar;
}

sub kens_code {
    my $bar;
    my @keys = keys %$hash_ref;

    for my $outer (@$array_ref) {
        my $sum;
        for my $inner (split /-/, $outer) {
            for (@keys) {
                $sum += $foo->{$inner}{$_}{value};
            }
        }
        push @{$bar->{$sum}}, $outer;
    }

    print '*** kens_code ***';
    print "@{[ $_, $#{$bar->{$_}}, $bar->{$_}[0] ]}" for keys %$bar;
}
[download]

The array, with two million elements, took about 30s (so it's roughly comparable to what the OP describes). My code was typically shaving around 25-30% off of this. I ran it quite a few times — here's a fairly representative run.

*** op_code ***
1275 1999999 v-w-x-y-z
op_code:   31.985272
*** kens_code ***
1275 1999999 v-w-x-y-z
kens_code: 23.075756
*** Compare ***
kens/op:   72.144941%
[download]

By the way, I totally agree with your comments re $a and $b: I only use those as special variables. I'm not completely averse to single-letter variable names, such as $i for a loop index; although, I do cringe when I find them liberally scattered through production code — meaningful names are a much better choice.

— Ken

Comment on Re^2: Optimize a perl block Select or Download Code

Replies are listed 'Best First'.
Re^3: Optimize a perl block by Eily (Monsignor) on Sep 07, 2017 at 08:35 UTC
Hi kcott, thanks for your answer :). I also tried inlining the results (now discarded, but it was something like: "sub hrkeys () { keys %$hash_ref }"): that proved to be slower than using "@keys". The () prototype makes a function a candidate for inlining, but only if the function doesn't have side effects and returns a constant value. keys can both have side effects (when working on a tied hash), and isn't guaranteed to return a constant value so your function couldn't be inlined. I didn't think of caching. That's a good idea, and might put "sum(0) map" back in the picture; however, as you stated, that will depend on the OP's data (which hasn't been shown). Well I used your code to try, and the runtime is divided by ten. But that's because all the keys in the outer hash are identical, so the the sum is only computed once, and the cache is used everytime after that. I doubt the OP's data is that redundant :). Here's my code BTW: `sub eilys_code { my $bar; my %cache; my @keys = keys %$hash_ref; for my $outer (@$array_ref) { my $sum; for my $inner (split /-/, $outer) { $sum = $cache{$inner} //= sum map { $_->{value} } @{ $foo->{ +$inner} }{@keys}; } push @{$bar->{$sum}}, $outer; } print '* kens_code *'; print "@{[ $_, $#{$bar->{$_}}, $bar->{$_}[0] ]}" for keys %$bar; }` [download] The fact that with this sample data all the sums are identical makes this test less representative though, because `push @{$bar->{$sum}}, $outer;` autovivifies an array only once. I'm not completely averse to single-letter variable names, such as $i for a loop index I think $x, $y and $z as coordinates are fine. I still try to avoid $i though, because most of the time you don't need a loop index in perl, so if one is needed, there's probably a meaningful name to explain why.	[reply] [d/l] [select]
Re^4: Optimize a perl block by kcott (Archbishop) on Sep 07, 2017 at 23:26 UTC
"... keys ... isn't guaranteed to return a constant value so your function couldn't be inlined." I disagree with that in part. I did take into consideration this from the keys doco: "Hash entries are returned in an apparently random order." Along with this, a couple of sentences later: "So long as a given hash is unmodified you may rely on keys, values and each to repeatedly return the same order as each other. " A very quick and dirty test (which I ran a number of times) with `$ perl -e 'my %x = qw{@ 1 \| 2}; for (1..30) { print keys %x }; print " +\n"'` [download] always returned one of these two: `\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@ @\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|@\|` [download] So, that's a reasonable confirmation of the doco, and the fact that my Perl version (`5.26.0`, by the way) is adhering to this. Anyway, I went back to the code I posted earlier, and added the subroutine requesting inlining back in; however, this time I wrote it like this, so that `@keys` was constant and couldn't possibly be altered elsewhere due to the lexical scoping of the anonymous block: `{ my @keys = keys %$hash_ref; sub hrkeys () { @keys } }` [download] There was no improvement on the OP's code. I ran it a few times, the "`kens/op:`" value was now showing 100% (±2%). With regard to the caching, that's the reason I didn't spend any time trying to check it. The minimal sanity checking (that the OP's and my code were producing equivalent results) showed `%$bar` had one key (`"1275"`) and its arrayref value held all of the two million elements. — Ken	[reply] [d/l] [select]
Re^5: Optimize a perl block by Eily (Monsignor) on Sep 08, 2017 at 08:47 UTC
I was thinking about the fact that the the output of keys can change when the input changes. As in `sub hrkeys () { keys %$hash_ref }`, where you just have to change the content of %$hash_ref to change the output. But my wording wasn't clear.	[reply] [d/l]