artist has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I have articles table and entries table which contains records. Each record has feature and corresponding values. Each record can have different number of features from a fixed set, and it can carry different values. Set of values is not fixed however. I am trying to find out all the entries which has atleast 3 matching features and values for a given article.

Currently my code take long amount of time to run.

# We'll pass it as ref to SCALAR and receive it in a ref to ARRAY use constant ARTICLES => 100; use constant ENTRIES => 2000; #Articles foreach $article ( 1 .. ARTICLES ) { foreach $feature ( 1 .. 10 ) { $A->{$article}{$feature} = int rand(100); } } #Entries foreach $entry ( 1 .. ENTRIES ) { foreach $feature ( 1 .. 10 ) { $E->{$entry}{$feature} = int rand(100); } } $| += 1; foreach my $article ( 1 .. ARTICLES ) { foreach $entry ( 1 .. ENTRIES ) { # print "." if $entry % 1000 == 0; my $count; foreach $feature ( 1 .. 10 ) { if ( $E->{$entry}{$feature} eq $A->{$article}{$feature}){ $count += 1; } } if ( $count >= 3 ) { #Found match push @{ $X->{$article} }, $entry; } } } foreach my $article ( sort { $a <=> $b } keys %{$X} ) { print "$article:"; foreach my $entry ( @{ $X->{$article} } ) { print "\t$entry"; } print "\n"; }
Sample Output:
1:	1120
7:	1563
12:	341
24:	4
25:	1977
28:	455
30:	143
36:	618
43:	802
44:	107
48:	263
50:	1824
53:	1734
60:	1778
64:	350	448
71:	1673
72:	691
75:	780
76:	265	1669
80:	1260
85:	1429	1966
98:	758
99:	60
Any better and faster method would be useful. Currently it works fine, but for 10000 articles and 200,000 entries, this takes a lot of time.
Thank you, br>artist.

Replies are listed 'Best First'.
Re: Articles matching entries
by kvale (Monsignor) on Jul 11, 2003 at 22:36 UTC
    If all the indices are numbers, just use arrays for more compact data structures and faster access.

    Algorithmically, one way of speeding up the process is to build an index. In your program, each (entry,feature) value is being accessed many times. How about building an index of values, with each value pointing to the number of (entry,feature) pairs with that value. Then for each (article, feature) value, you acess the (entry,feature) count directly, instead of computing it in a loop every time. The time taken to build the index will far outweigh the cost of all those eliminated loops.

    -Mark

Re: Articles matching entries (93% less time at least)
by BrowserUk (Patriarch) on Jul 13, 2003 at 01:47 UTC

    Starting with your original code

    1 trial of original (37.764s total)

    The first thing I did was to move everything to using lexicals. which gained a 6%.

    1 trial of lexicals (35.801s total)

    Next, I enabled strict and warnings. Surprisingly, this actually sped things up slightly. All the timings shown are the middle of three consectutive runs. Not scientific, but good enough I think.

    1 trial of lexicals strict warnings (35.571s total)

    Next, I moved from using hashrefs to simple hashes. I mentioned this to you before but here's a little show that using references unnecessarially extracts a penalty, even if only 3% or so.

    1 trial of Hashes instead of hashrefs (34.850s total)

    Then, as mentioned above, using hashes with continguous integer keys when an array would better do the job also takes its toll. In this case a whole 40%.

    1 trial of Arrays instead of Hashes (21.070s total)

    Finally, moving loops into C where possible by using perls ever-so-handy build-ins, in this case grep in a scalar context gets another 33% speed up.

    1 trial of Using scalar grep to count occurances (14.281s total)

    Here's my version with all the changes above. I tried to retain everything as near as possible to your original including having the arrays start at 1. Moving this to zero-based arrays, (or possibly using the deprecated $[) would probably gain another 1 or 2%.

    I then got to thinking about trying to improve things by creating a lookup hash to prevent the need to search.

    This is complicated by the need to compare both keys (indices) and values and further by the "any three from ten" factor. After several aborted attempts, I came up with the implementation below that creates a hash using keys that contain each combination of 3 indices and the corresponding 3 values, stringified, for each of the Articles. The value is an array of articles from which that key is generated.

    The detection loop then contructs a key from each combination of each of the Entries and looks that up in the hash. This reduces the number of nested loops, but the generation of the combinations and stringification mean that the benefits are not as great as you might hope for.

    Originally, I was generating the combinations for each set of keys/values, but then realised that as it is always 3 from ten, I could get away with generating a single set of indices, and then use this over and over in array slices to extract the values. Once that had dawned on me, I achieved another 25% speed up over the previous best.

    This accumulates to an overall saving of around 70% from your original, on sets of 100 x 2000 which isn't too bad and is probably enough to warrent the extra compexity.