I don't have time to do a full code implementation, at least not right now, but I may be able to give you enough to get you going on an implementation by sharing my thoughts so far.
Since you'll be getting more subspectra that you want to check, you've got a many sets to many sets matching task. You need to decide two things.
Lets look at optimizing. You are optimizing the process of finding the nearest point y from a set, y_set, for a given point x taken from the other set, the x_set, which you go through one at a time. The general categories of optimization are
Let's assume that you can abandon a match if x and y are more than 5.00 appart. Set up hashes
This matching will perform well in cases where the match is between two points that are close. The sub have_best_within() takes up a longer time when there is no close match, but there are lots of points that are not outside two times the outer match threshold. If that case comes up often enough to worry about, then you can use binary search for that (or those) cases. For that you will also have to sort the arrays within each hash value, but just the ones at the coarser threshold, where you're using binary seach. Note that binary search looses you time if there aren't many elements to look through, say less than 10 or 20. In that case linear search can be faster (There aren't very many arrays at the coarser thresholds, so it doesn't take so long to sort them all.)my %dist_0_00; # for exact matches my %dist_0_05; # for points that may match within 0.05 my %dist_0_50; # for points that may match within 0.50 my %dist_5_00; # for points that may match within 5.00 my ($y005,$y050,$y500); for $y (@y_set) { $y005 = floor($y*10)/10; $y050 = floor($y); $y500 = floor($y/10)*10; $dist_0_00{$y}=[$y]; push @{$dist_0_05{$y005}}, $y; push @{$dist_0_05{$y005+0.1}}, $y; push @{$dist_0_50{$y050}}, $y; push @{$dist_0_50{$y050+1.0}}, $y; push @{$dist_5_00{$y500}}, $y; push @{$dist_5_00{$y500+10.0}},$y; } # so if $x = 18.94, all the matches that are within 0.05 # of it are in either in the array at arrayref # $dist_0_05{18.8} or in the one at $dist_0_05{18.9} # similarly, the matches within 0.5 are in two hash elements # of %dist_0_50. my @matches; my $x_bin; for $x ($x_set) { if ( $dist_0_00{$x} ) { # exact match push @match,[$x,$x]; next; } $x_bin = floor(($x-0.05)*10).10.0; if ( @y_test_low = $dist_0_05{$xbin} || @y_test_high= $dist_0_05{$xbin+0.1} ) { if (defined($y=have_best_within(0.05,@y_test_low,@y_test_high)){ push @match,[$x,$]; next; } } # similarly for 0.5 and 5.0 thresholds } # for $x
If you decide to optimize search on the database side, the the hashes could become tied hashes, indexed not only by the floor()ed $y values, but by a sectrum id as well. The performance of that is another wrinkle to be thought through.
Good luck with your project. Feel free to message me on things that were unclear.
In reply to Re^3: Calculate the similarity of two arrays of numbers
by rodion
in thread Calculate the similarity of two arrays of numbers
by Ieronim
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |