comment on

Starting with your original code

P:\test>273557.pl8
14:     462
15:     684
19:     898
21:     15      1892
22:     1997
41:     223
43:     762
45:     668
46:     1759
54:     1962
56:     509
61:     1497
68:     1006
69:     342     349
70:     565
75:     845
84:     1318
92:     382
99:     315
[download]

1 trial of original (37.764s total)

The first thing I did was to move everything to using lexicals. which gained a 6%.

P:\test>273557.pl8
P:\test>273557.pl8
14:     462
15:     684
19:     898
21:     15      1892
22:     1997
41:     223
43:     762
45:     668
46:     1759
54:     1962
56:     509
61:     1497
68:     1006
69:     342     349
70:     565
75:     845
84:     1318
92:     382
99:     315
[download]

1 trial of lexicals (35.801s total)

Next, I enabled strict and warnings. Surprisingly, this actually sped things up slightly. All the timings shown are the middle of three consectutive runs. Not scientific, but good enough I think.

P:\test>273557.pl8
14:     462
15:     684
19:     898
21:     15      1892
22:     1997
41:     223
43:     762
45:     668
46:     1759
54:     1962
56:     509
61:     1497
68:     1006
69:     342     349
70:     565
75:     845
84:     1318
92:     382
99:     315
[download]

1 trial of lexicals strict warnings (35.571s total)

Next, I moved from using hashrefs to simple hashes. I mentioned this to you before but here's a little show that using references unnecessarially extracts a penalty, even if only 3% or so.

P:\test>273557.pl8
14:     462
15:     684
19:     898
21:     15      1892
22:     1997
41:     223
43:     762
45:     668
46:     1759
54:     1962
56:     509
61:     1497
68:     1006
69:     342     349
70:     565
75:     845
84:     1318
92:     382
99:     315
[download]

1 trial of Hashes instead of hashrefs (34.850s total)

Then, as mentioned above, using hashes with continguous integer keys when an array would better do the job also takes its toll. In this case a whole 40%.

P:\test>273557.pl8
14:     462
15:     684
19:     898
21:     15      1892
22:     1997
41:     223
43:     762
45:     668
46:     1759
54:     1962
56:     509
61:     1497
68:     1006
69:     342     349
70:     565
75:     845
84:     1318
92:     382
99:     315
[download]

1 trial of Arrays instead of Hashes (21.070s total)

Finally, moving loops into C where possible by using perls ever-so-handy build-ins, in this case grep in a scalar context gets another 33% speed up.

P:\test>273557.pl8
14:     462
15:     684
19:     898
21:     15      1892
22:     1997
41:     223
43:     762
45:     668
46:     1759
54:     1962
56:     509
61:     1497
68:     1006
69:     342     349
70:     565
75:     845
84:     1318
92:     382
99:     315
[download]

1 trial of Using scalar grep to count occurances (14.281s total)

Here's my version with all the changes above. I tried to retain everything as near as possible to your original including having the arrays start at 1. Moving this to zero-based arrays, (or possibly using the deprecated $[) would probably gain another 1 or 2%.

#! perl -sw
use strict;
use Benchmark::Timer;
use vars qw[$SEED];
# We'll pass it as ref to SCALAR and receive it in a ref to ARRAY
use constant ARTICLES => 100;
use constant ENTRIES  => 2000;

my $t = Benchmark::Timer->new();

srand( $S || 1 );

$t->start('Using scalar grep to count occurances');

my( @A, @E, %X );
my( %A );

#Articles
foreach my $article ( 1 .. ARTICLES ) {
    foreach my $feature ( 1 .. 10 ) {
        $A[$article][$feature] = int rand(100);
    }
}

#Entries
foreach my $entry ( 1 .. ENTRIES ) {
    foreach my $feature ( 1 .. 10 ) {
        $E[$entry][$feature] = int rand(100);
    }
}
$| += 1;
foreach my $article ( 1 .. ARTICLES ) {
    foreach my $entry ( 1 .. ENTRIES ) {

        my $count = grep{ $E[$entry][$_] eq $A[$article][$_] } 1 .. 10
+;
        push @{ $X{$article} }, $entry if  $count >= 3;

    }
}
foreach my $article ( sort { $a <=> $b } keys %X ) {
    print "$article:";
    foreach my $entry ( @{ $X{$article} } ) {
        print "\t$entry";
    }
    print "\n";
}

$t->stop('Using scalar grep to count occurances');
$t->report;
[download]

I then got to thinking about trying to improve things by creating a lookup hash to prevent the need to search.

This is complicated by the need to compare both keys (indices) and values and further by the "any three from ten" factor. After several aborted attempts, I came up with the implementation below that creates a hash using keys that contain each combination of 3 indices and the corresponding 3 values, stringified, for each of the Articles. The value is an array of articles from which that key is generated.

The detection loop then contructs a key from each combination of each of the Entries and looks that up in the hash. This reduces the number of nested loops, but the generation of the combinations and stringification mean that the benefits are not as great as you might hope for.

Originally, I was generating the combinations for each set of keys/values, but then realised that as it is always 3 from ten, I could get away with generating a single set of indices, and then use this over and over in array slices to extract the values. Once that had dawned on me, I achieved another 25% speed up over the previous best.

This accumulates to an overall saving of around 70% from your original, on sets of 100 x 2000 which isn't too bad and is probably enough to warrent the extra compexity.

#! perl -slw
use strict;
use Benchmark::Timer;
use vars qw[$SEED];

use constant ARTICLES => 100;
use constant ENTRIES  => 2000;
$| += 1;

sub uniq{ my %h; @h{ @_ } = (); return keys %h; }

# return a list of all the combinations of n (1st param) values
# from r (the rest of the argument list).
sub Cnr{
    my( $n, @r ) = shift;
    return [] unless $n--;
    for my $x ( 0 .. ($#_ - $n) ) {
        push @r, map{ [ $_[$x], @$_ ] } Cnr( $n, @_[ ($x + 1) .. $#_ ]
+ ) ;
    }
    return @r;
}

my $t = Benchmark::Timer->new();

srand( $SEED || 1 );

$t->start('Building an using a lookup list');

my( @A, @E, %X );
my( %A );
my @combs = Cnr( 3, 1 .. 10 );
#Articles
for my $article ( 1 .. ARTICLES ) {
    for my $feature ( 1 .. 10 ) {
        $A[$article][$feature] = int rand(100);
    }
    push @{ $A{ "@$_:@{ $A[$article] }[ @$_ ]" } }, $article
        for @combs;
}
# print "$_ : @{ $A[$_] }[ 1 .. 10 ]" for 1 .. ARTICLES;

#Entries
for my $entry ( 1 .. ENTRIES ) {
    for my $feature ( 1 .. 10 ) {
        $E[$entry][$feature] = int rand(100);
    }
}
# print "$_ : @{ $E[$_] }[ 1 .. 10 ]" for 1 .. ENTRIES;

for my $entry ( 1 .. ENTRIES ) {
    for my $comb (  map{ "@$_:@{ $E[ $entry ] }[ @$_ ]" } @combs ) {
        next unless exists $A{ $comb };
        push @{ $X{ $_ } }, $entry for @{ $A{ $comb } };
    }

}

for my $article ( sort { $a <=> $b } keys %X ) {
    printf STDERR "$article:";
    for my $entry ( @{ $X{$article} } ) {
        printf STDERR "\t$entry";
    }
    print STDERR '';
}

$t->stop('Building an using a lookup list');
$t->report;

__OUTPUT__
P:\test>273557
14:     462
15:     684
19:     898
21:     15      1892
22:     1997
41:     223
43:     762     762     762     762 (*)
45:     668
46:     1759
54:     1962
56:     509
61:     1497
68:     1006
69:     342     349
70:     565
75:     845
84:     1318
92:     382
99:     315
     1 trial  of Building an using a lookup list (10.705s total)
[download]

However, the lookup solution really starts to shine once the number of lookup -v- comparisons increases. Here is a Benchmark::cmpthese style comparison of the three versions when run on 1000 articles x 2000 entries.

        time(s)    original    array/grep     lookup
  original    414.536        ----          -64%      -93%
array/grep    149.685         177%          ----      -81%
    lookup     27.750     1394%          439%      ----
[download]

A reduction of 93% is well worth the complexity. And by the time this became the 10,000 x 200,000 you mentioned at the end of your post, I project (but haven't tested) reductions to the order of -99.5% or 200 times faster.

(*) The astute (interested? bothered?) amongst you may have noted a small difference between the output from the lookup version and the other two, in that Entry 762 is listed as a match for Article 43 4 times. This had me tearing my hair out for a while until I inspected the details and discovered that Entry 762 can match Article 43 4 different ways.

Artists original code would detect these 'duplicate' matches, but only record, and therefore display, one occurance. If he whiched to retain that behaviour, then the simplest (and cheapest) modification is to exclude the dups at output by replacing the line

    for my $entry ( @{ $X{$article} } ) {
[download]

With

    for my $entry ( uniq @{ $X{$article} } ) {
[download]

or similar.

I included my uniq() sub for this purpose. As an aside, a version of uniq written in C or XS would make a nice addition to List::Utils I think?

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

In reply to Re: Articles matching entries (93% less time at least) by BrowserUk
in thread Articles matching entries by artist

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.