Starting with your original code

P:\test>273557.pl8 14: 462 15: 684 19: 898 21: 15 1892 22: 1997 41: 223 43: 762 45: 668 46: 1759 54: 1962 56: 509 61: 1497 68: 1006 69: 342 349 70: 565 75: 845 84: 1318 92: 382 99: 315

1 trial of original (37.764s total)

The first thing I did was to move everything to using lexicals. which gained a 6%.

P:\test>273557.pl8 P:\test>273557.pl8 14: 462 15: 684 19: 898 21: 15 1892 22: 1997 41: 223 43: 762 45: 668 46: 1759 54: 1962 56: 509 61: 1497 68: 1006 69: 342 349 70: 565 75: 845 84: 1318 92: 382 99: 315

1 trial of lexicals (35.801s total)

Next, I enabled strict and warnings. Surprisingly, this actually sped things up slightly. All the timings shown are the middle of three consectutive runs. Not scientific, but good enough I think.

P:\test>273557.pl8 14: 462 15: 684 19: 898 21: 15 1892 22: 1997 41: 223 43: 762 45: 668 46: 1759 54: 1962 56: 509 61: 1497 68: 1006 69: 342 349 70: 565 75: 845 84: 1318 92: 382 99: 315

1 trial of lexicals strict warnings (35.571s total)

Next, I moved from using hashrefs to simple hashes. I mentioned this to you before but here's a little show that using references unnecessarially extracts a penalty, even if only 3% or so.

P:\test>273557.pl8 14: 462 15: 684 19: 898 21: 15 1892 22: 1997 41: 223 43: 762 45: 668 46: 1759 54: 1962 56: 509 61: 1497 68: 1006 69: 342 349 70: 565 75: 845 84: 1318 92: 382 99: 315

1 trial of Hashes instead of hashrefs (34.850s total)

Then, as mentioned above, using hashes with continguous integer keys when an array would better do the job also takes its toll. In this case a whole 40%.

P:\test>273557.pl8 14: 462 15: 684 19: 898 21: 15 1892 22: 1997 41: 223 43: 762 45: 668 46: 1759 54: 1962 56: 509 61: 1497 68: 1006 69: 342 349 70: 565 75: 845 84: 1318 92: 382 99: 315

1 trial of Arrays instead of Hashes (21.070s total)

Finally, moving loops into C where possible by using perls ever-so-handy build-ins, in this case grep in a scalar context gets another 33% speed up.

P:\test>273557.pl8 14: 462 15: 684 19: 898 21: 15 1892 22: 1997 41: 223 43: 762 45: 668 46: 1759 54: 1962 56: 509 61: 1497 68: 1006 69: 342 349 70: 565 75: 845 84: 1318 92: 382 99: 315

1 trial of Using scalar grep to count occurances (14.281s total)

Here's my version with all the changes above. I tried to retain everything as near as possible to your original including having the arrays start at 1. Moving this to zero-based arrays, (or possibly using the deprecated $[) would probably gain another 1 or 2%.

#! perl -sw use strict; use Benchmark::Timer; use vars qw[$SEED]; # We'll pass it as ref to SCALAR and receive it in a ref to ARRAY use constant ARTICLES => 100; use constant ENTRIES => 2000; my $t = Benchmark::Timer->new(); srand( $S || 1 ); $t->start('Using scalar grep to count occurances'); my( @A, @E, %X ); my( %A ); #Articles foreach my $article ( 1 .. ARTICLES ) { foreach my $feature ( 1 .. 10 ) { $A[$article][$feature] = int rand(100); } } #Entries foreach my $entry ( 1 .. ENTRIES ) { foreach my $feature ( 1 .. 10 ) { $E[$entry][$feature] = int rand(100); } } $| += 1; foreach my $article ( 1 .. ARTICLES ) { foreach my $entry ( 1 .. ENTRIES ) { my $count = grep{ $E[$entry][$_] eq $A[$article][$_] } 1 .. 10 +; push @{ $X{$article} }, $entry if $count >= 3; } } foreach my $article ( sort { $a <=> $b } keys %X ) { print "$article:"; foreach my $entry ( @{ $X{$article} } ) { print "\t$entry"; } print "\n"; } $t->stop('Using scalar grep to count occurances'); $t->report;

I then got to thinking about trying to improve things by creating a lookup hash to prevent the need to search.

This is complicated by the need to compare both keys (indices) and values and further by the "any three from ten" factor. After several aborted attempts, I came up with the implementation below that creates a hash using keys that contain each combination of 3 indices and the corresponding 3 values, stringified, for each of the Articles. The value is an array of articles from which that key is generated.

The detection loop then contructs a key from each combination of each of the Entries and looks that up in the hash. This reduces the number of nested loops, but the generation of the combinations and stringification mean that the benefits are not as great as you might hope for.

Originally, I was generating the combinations for each set of keys/values, but then realised that as it is always 3 from ten, I could get away with generating a single set of indices, and then use this over and over in array slices to extract the values. Once that had dawned on me, I achieved another 25% speed up over the previous best.

This accumulates to an overall saving of around 70% from your original, on sets of 100 x 2000 which isn't too bad and is probably enough to warrent the extra compexity.

#! perl -slw use strict; use Benchmark::Timer; use vars qw[$SEED]; use constant ARTICLES => 100; use constant ENTRIES => 2000; $| += 1; sub uniq{ my %h; @h{ @_ } = (); return keys %h; } # return a list of all the combinations of n (1st param) values # from r (the rest of the argument list). sub Cnr{ my( $n, @r ) = shift; return [] unless $n--; for my $x ( 0 .. ($#_ - $n) ) { push @r, map{ [ $_[$x], @$_ ] } Cnr( $n, @_[ ($x + 1) .. $#_ ] + ) ; } return @r; } my $t = Benchmark::Timer->new(); srand( $SEED || 1 ); $t->start('Building an using a lookup list'); my( @A, @E, %X ); my( %A ); my @combs = Cnr( 3, 1 .. 10 ); #Articles for my $article ( 1 .. ARTICLES ) { for my $feature ( 1 .. 10 ) { $A[$article][$feature] = int rand(100); } push @{ $A{ "@$_:@{ $A[$article] }[ @$_ ]" } }, $article for @combs; } # print "$_ : @{ $A[$_] }[ 1 .. 10 ]" for 1 .. ARTICLES; #Entries for my $entry ( 1 .. ENTRIES ) { for my $feature ( 1 .. 10 ) { $E[$entry][$feature] = int rand(100); } } # print "$_ : @{ $E[$_] }[ 1 .. 10 ]" for 1 .. ENTRIES; for my $entry ( 1 .. ENTRIES ) { for my $comb ( map{ "@$_:@{ $E[ $entry ] }[ @$_ ]" } @combs ) { next unless exists $A{ $comb }; push @{ $X{ $_ } }, $entry for @{ $A{ $comb } }; } } for my $article ( sort { $a <=> $b } keys %X ) { printf STDERR "$article:"; for my $entry ( @{ $X{$article} } ) { printf STDERR "\t$entry"; } print STDERR ''; } $t->stop('Building an using a lookup list'); $t->report; __OUTPUT__ P:\test>273557 14: 462 15: 684 19: 898 21: 15 1892 22: 1997 41: 223 43: 762 762 762 762 (*) 45: 668 46: 1759 54: 1962 56: 509 61: 1497 68: 1006 69: 342 349 70: 565 75: 845 84: 1318 92: 382 99: 315 1 trial of Building an using a lookup list (10.705s total)

However, the lookup solution really starts to shine once the number of lookup -v- comparisons increases. Here is a Benchmark::cmpthese style comparison of the three versions when run on 1000 articles x 2000 entries.

time(s) original array/grep lookup original 414.536 ---- -64% -93% array/grep 149.685 177% ---- -81% lookup 27.750 1394% 439% ----

A reduction of 93% is well worth the complexity. And by the time this became the 10,000 x 200,000 you mentioned at the end of your post, I project (but haven't tested) reductions to the order of -99.5% or 200 times faster.

(*) The astute (interested? bothered?) amongst you may have noted a small difference between the output from the lookup version and the other two, in that Entry 762 is listed as a match for Article 43 4 times. This had me tearing my hair out for a while until I inspected the details and discovered that Entry 762 can match Article 43 4 different ways.

Artists original code would detect these 'duplicate' matches, but only record, and therefore display, one occurance. If he whiched to retain that behaviour, then the simplest (and cheapest) modification is to exclude the dups at output by replacing the line

for my $entry ( @{ $X{$article} } ) {

With

for my $entry ( uniq @{ $X{$article} } ) {

or similar.

I included my uniq() sub for this purpose. As an aside, a version of uniq written in C or XS would make a nice addition to List::Utils I think?


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller



In reply to Re: Articles matching entries (93% less time at least) by BrowserUk
in thread Articles matching entries by artist

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.