Calculate the similarity of two arrays of numbers

Ieronim has asked for the wisdom of the Perl Monks concerning the following question:

I have two arrays of rational numbers, e.g.

my @spectrum = qw/174.0 102.99 71.62 70.53 82.49 69.05 18.94 102.99 71
+.62 70.53 82.49 69.05 18.94 102.99 71.62 70.53 82.49 69.05 18.94 102.
+99 71.62 70.53 82.49 69.05 18.94/;
my @subsp = qw/102.0 18.94 71.61 82.0 173.0 68.7/;
[download]

For practical reasons, the bigger is called spectrum, and the smaller is called subspectrum. Every number in the arrays is called signal.

The difference between spectrum and subspectrum is calculated as sum of absolute values of differences between the signals in subspectrum and the closest signals in big spectrum. The 'similarity' is simply the number of signals in subspectrum divided by the difference. If the difference is 0 (impossible in most cases), the similarity is considered 1000.

I need to write a function which calculates the similarity between given spectrum and subspectrum and returns it with set of 'best' signals from big spectrum. For the given spectrum and subspectrum the answer is

REQUEST    = 18.94 68.7 71.61 82.0 102.0 173.0
BEST_SUBSP = 18.94 69.05 71.62 82.49 102.99 174.0
SIMILARITY = 2.11
[download]

The main problem about the function is that it needs to be really fast, as it is used very intensively.

The best way i have found as yet is implementing the alpha-beta pruning search algorithm. The parameter $limit accepted by the function appears from practice too - it limites maximum the difference between a pair of signals in spectrum and subspectrum. If the difference exceeds the limit, the similarity is considered to be 0.

#!/usr/bin/perl

use warnings;
use strict;

sub similarity {
    my ($request, $spectrum, $accuracy, $limit) = @_[0..3];
    my $fmt = "%.${accuracy}f";

    my @spectrum = @$spectrum;
    my @request  = sort {$a <=> $b} @$request;

    return 0, [] unless @request;
    return 0, [] if @spectrum < @request;
    
    my %good;
    foreach my $signal (@request) {
        my @good;
        return 0 unless (@good = grep { abs($_-$signal) <= $limit } @s
+pectrum);
        %good = (%good, map { ($_ => 1) } @good);
    }
    @spectrum = sort {$a <=> $b} grep { $good{$_} } @spectrum;
    
    my $alpha;
    my $best_subsp;
    my $alphabeta;
    $alphabeta = sub {
        my ($beta, $spectrum, $subsp, $found) = @_;
        unless (@$subsp) {
            $alpha = $beta;
            $best_subsp = $found;
            return;
        }

        my $compared = $subsp->[0];
        foreach my $sign_id (0..$#{$spectrum}) {
            my $diff = abs($spectrum->[$sign_id] - $compared);
            next if $diff > $limit;
            next if defined($alpha) && ($diff + $beta) >= $alpha;
            $alphabeta->($diff + $beta, 
                [@{$spectrum}[0..$sign_id-1, $sign_id+1..$#{$spectrum}
+]],
                [@{$subsp}[1..$#{$subsp}]],
                [@$found, $spectrum->[$sign_id]],
            );
        }
        return;
    };
    $alphabeta->(0, \@spectrum, \@request, []);
    return 0, [] unless defined $alpha;

    my $max_similarity = sprintf $fmt, $alpha ? @request/$alpha : 1000
+;
    return $max_similarity, $best_subsp;    
}

my $limit = 3;

my @spectrum = qw/174.0 102.99 71.62 70.53 82.49 69.05 18.94 102.99 71
+.62 70.53 82.49 69.05 18.94 102.99 71.62 70.53 82.49 69.05 18.94 102.
+99 71.62 70.53 82.49 69.05 18.94/;
my @subsp = qw/102.0 18.94 71.61 82.0 173.0 68.7/;

my ($sim, $best) = similarity(\@subsp, \@spectrum, 2, $limit);

@subsp = sort {$a <=> $b} @subsp;

print <<DEBUG;
REQUEST     = @subsp
BEST_SUBSP  = @{$best}
SIMILARITY  = $sim
DEBUG
[download]

The question is: does anybody know a better (eq 'faster') solution for this case?

s;;Just-me-not-h-Ni-m-P-Ni-lm-I-ar-O-Ni;;tr?IerONim-?HAcker ?d;print

Comment on Calculate the similarity of two arrays of numbers Select or Download Code

Replies are listed 'Best First'.
Re: Calculate the similarity of two arrays of numbers by GrandFather (Saint) on Jul 20, 2006 at 23:07 UTC
The following code may help. It uses a binary search to find the closest value in the spectrum data to each value in the subspectrum data. Because this requires a sort and elimination of duplicates pass there would be some virtue in factoring that code out so it is done once may be of advantage. The code below uses the sub gf2 to preprocess @spectrum for gf1. sub gf2 { my ($spectrum, $subsp) = @_; my %unique; @unique{@spectrum} = (); my @uspectrum = sort {$a <=> $b} keys %unique; return gf1(\@uspectrum, $subsp); } sub gf1 { my ($spectrum, $subsp) = @_; my $sum = 0; for my $value (@$subsp) { my $mid = int (@$spectrum / 2); my $span = int (@$spectrum / 2); while (int $span) { if ($value < $spectrum->[$mid]) { $mid -= int (($span /= 2) + 0.5); } elsif ($value > $spectrum->[$mid]) { $mid += int (0.5 + ($span /= 2)); } else { last; } } my $delta; my $left = $mid ? abs ($value - $spectrum->[$mid - 1]) : 1e6; my $right = $mid < $#$spectrum ? abs ($value - $spectrum->[$mi +d + 1]) : 1e6; my $cent = abs ($value - $spectrum->[$mid]); $sum += (sort {$a <=> $b} ($left, $right, $cent))[0]; } my $result = 1000; $result = @$subsp / $sum if $sum != 0; return $result; } [download] Benchmark results are: `gf1 result: 2.11267605633804 gf2 result: 2.11267605633804 op result: 2.11267605633804 Rate op gf2 gf1 op 7.57/s -- -100% -100% gf2 7732/s 102012% -- -10% gf1 8582/s 113238% 11% --` [download] Read more... Full benchmark code (4 kB) DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: Calculate the similarity of two arrays of numbers by rodion (Chaplain) on Jul 20, 2006 at 21:04 UTC
Since you've already looked at this problem a fair amount and done some optimization, further improvements are likely to depend on the partiulars. Thus there are naturally some quick questions right off. Mine have to do with your saying it needs to be really fast. I'm assuming this has to be fast because it's run again and again, on successive data sets. Thus compile time is irrelevent to performance. Does the spectrum change from call to call, or can one prep the spectrum once and run a bunch of subspectra against it. If so, how many subspectra for each spectrum. What's the dynamic range of the data. What's the data's precision. Are these things that can be determened once, entered once in advance of each run, or maybe they can't be determined at all in advance. (Solving things in the more general case can be way more expensive that solving a more particular problem, which is why engineering thinking is so different from mathematics. The mathematics ignores the boundary conditions, engineering often starts with them.)	[reply]
Re^2: Calculate the similarity of two arrays of numbers by Ieronim (Friar) on Jul 20, 2006 at 22:02 UTC
At this moment the function is used to query the database of spectra to find the big spectra 'similiar' to given small subspectrum. So i match nearly thousand of spectra (at the moment, the database is continiously increasing) against one subspectrum. I did not think about any optimizations when i wrote the first version of the function (not shown here). But after a while i realized that most requests cause the server to hang and understood that the situation needs some improvement :) The function i demonstrated works about 400 times faster than the old one :) The dynamic range and precision of the data are known. All signals in all spectra lie between 200 and 0. The precision is always the same, e.g. "%.2f". But if you have any suggestions about how the function can be improved if i need to match e.g. 10_000 subspectra against one spectum, i'd be very glad to read them. `s;;Just-me-not-h-Ni-m-P-Ni-lm-I-ar-O-Ni;;tr?IerONim-?HAcker ?d;print`	[reply] [d/l]
Re^3: Calculate the similarity of two arrays of numbers by rodion (Chaplain) on Jul 21, 2006 at 05:47 UTC
Thanks for the additional info. It's very helpful. I don't have time to do a full code implementation, at least not right now, but I may be able to give you enough to get you going on an implementation by sharing my thoughts so far. Since you'll be getting more subspectra that you want to check, you've got a many sets to many sets matching task. You need to decide two things. Which side to optimize. How to optimize it. If the two sides were both stored in memory, you would simply want to optimize the side with more points (subspectrum_avg_size * number of subspectra compared with the same product for the spectra). However, the spectra are stored in the database so that you can compare them with subspectra that arrive over time. If you optimize the spectra, you need to be able to store the optimization in the database too, and retrieve it; this will have overhead which you need to figure in for making your choices of which side to optimize. The choice of optimization will affect this overhead, so the two questions interact. Lets look at optimizing. You are optimizing the process of finding the nearest point y from a set, y_set, for a given point x taken from the other set, the x_set, which you go through one at a time. The general categories of optimization are Unoptimized -- The y-set is un-ordered and you have to go through each of the y's for each x. It's simple, but slow, and you've established that it's too slow. Sorted Array -- You can use binary search to home in. Grandfather has already given you code for this. (Be careful implementing and testing your adaptation of it if you use it. Binary search is notoriously vulnerable to bugs around the boundary case. Small changes matter.) Hashed access -- Here you use a lookup function to get a small set of points that might be the closest, and then find the right one by checking each of these if there is more than one. That all sounds great for choice 3, and Perl is excelent at storing small arrays as the value of a hash, but it gets trickier because hashes are exact-match tools. How do you turn "being close" into an exact match? You can do it if you pick several fixed distance thresholds, and set up separate hashes for each. Let's assume that you can abandon a match if x and y are more than 5.00 appart. Set up hashes my %dist_0_00; # for exact matches my %dist_0_05; # for points that may match within 0.05 my %dist_0_50; # for points that may match within 0.50 my %dist_5_00; # for points that may match within 5.00 my ($y005,$y050,$y500); for $y (@y_set) { $y005 = floor($y10)/10; $y050 = floor($y); $y500 = floor($y/10)10; $dist_0_00{$y}=[$y]; push @{$dist_0_05{$y005}}, $y; push @{$dist_0_05{$y005+0.1}}, $y; push @{$dist_0_50{$y050}}, $y; push @{$dist_0_50{$y050+1.0}}, $y; push @{$dist_5_00{$y500}}, $y; push @{$dist_5_00{$y500+10.0}},$y; } # so if $x = 18.94, all the matches that are within 0.05 # of it are in either in the array at arrayref # $dist_0_05{18.8} or in the one at $dist_0_05{18.9} # similarly, the matches within 0.5 are in two hash elements # of %dist_0_50. my @matches; my $x_bin; for $x ($x_set) { if ( $dist_0_00{$x} ) { # exact match push @match,[$x,$x]; next; } $x_bin = floor(($x-0.05)*10).10.0; if ( @y_test_low = $dist_0_05{$xbin} \|\| @y_test_high= $dist_0_05{$xbin+0.1} ) { if (defined($y=have_best_within(0.05,@y_test_low,@y_test_high)){ push @match,[$x,$]; next; } } # similarly for 0.5 and 5.0 thresholds } # for $x [download] This matching will perform well in cases where the match is between two points that are close. The sub have_best_within() takes up a longer time when there is no close match, but there are lots of points that are not outside two times the outer match threshold. If that case comes up often enough to worry about, then you can use binary search for that (or those) cases. For that you will also have to sort the arrays within each hash value, but just the ones at the coarser threshold, where you're using binary seach. Note that binary search looses you time if there aren't many elements to look through, say less than 10 or 20. In that case linear search can be faster (There aren't very many arrays at the coarser thresholds, so it doesn't take so long to sort them all.) If you decide to optimize search on the database side, the the hashes could become tied hashes, indexed not only by the floor()ed $y values, but by a sectrum id as well. The performance of that is another wrinkle to be thought through. Good luck with your project. Feel free to message me on things that were unclear.	[reply] [d/l]