Thanks for the additional info. It's very helpful.

I don't have time to do a full code implementation, at least not right now, but I may be able to give you enough to get you going on an implementation by sharing my thoughts so far.

Since you'll be getting more subspectra that you want to check, you've got a many sets to many sets matching task. You need to decide two things.

If the two sides were both stored in memory, you would simply want to optimize the side with more points (subspectrum_avg_size * number of subspectra compared with the same product for the spectra). However, the spectra are stored in the database so that you can compare them with subspectra that arrive over time. If you optimize the spectra, you need to be able to store the optimization in the database too, and retrieve it; this will have overhead which you need to figure in for making your choices of which side to optimize. The choice of optimization will affect this overhead, so the two questions interact.

Lets look at optimizing. You are optimizing the process of finding the nearest point y from a set, y_set, for a given point x taken from the other set, the x_set, which you go through one at a time. The general categories of optimization are

  1. Unoptimized -- The y-set is un-ordered and you have to go through each of the y's for each x. It's simple, but slow, and you've established that it's too slow.
  2. Sorted Array -- You can use binary search to home in. Grandfather has already given you code for this. (Be careful implementing and testing your adaptation of it if you use it. Binary search is notoriously vulnerable to bugs around the boundary case. Small changes matter.)
  3. Hashed access -- Here you use a lookup function to get a small set of points that might be the closest, and then find the right one by checking each of these if there is more than one.
That all sounds great for choice 3, and Perl is excelent at storing small arrays as the value of a hash, but it gets trickier because hashes are exact-match tools. How do you turn "being close" into an exact match? You can do it if you pick several fixed distance thresholds, and set up separate hashes for each.

Let's assume that you can abandon a match if x and y are more than 5.00 appart. Set up hashes

my %dist_0_00; # for exact matches my %dist_0_05; # for points that may match within 0.05 my %dist_0_50; # for points that may match within 0.50 my %dist_5_00; # for points that may match within 5.00 my ($y005,$y050,$y500); for $y (@y_set) { $y005 = floor($y*10)/10; $y050 = floor($y); $y500 = floor($y/10)*10; $dist_0_00{$y}=[$y]; push @{$dist_0_05{$y005}}, $y; push @{$dist_0_05{$y005+0.1}}, $y; push @{$dist_0_50{$y050}}, $y; push @{$dist_0_50{$y050+1.0}}, $y; push @{$dist_5_00{$y500}}, $y; push @{$dist_5_00{$y500+10.0}},$y; } # so if $x = 18.94, all the matches that are within 0.05 # of it are in either in the array at arrayref # $dist_0_05{18.8} or in the one at $dist_0_05{18.9} # similarly, the matches within 0.5 are in two hash elements # of %dist_0_50. my @matches; my $x_bin; for $x ($x_set) { if ( $dist_0_00{$x} ) { # exact match push @match,[$x,$x]; next; } $x_bin = floor(($x-0.05)*10).10.0; if ( @y_test_low = $dist_0_05{$xbin} || @y_test_high= $dist_0_05{$xbin+0.1} ) { if (defined($y=have_best_within(0.05,@y_test_low,@y_test_high)){ push @match,[$x,$]; next; } } # similarly for 0.5 and 5.0 thresholds } # for $x
This matching will perform well in cases where the match is between two points that are close. The sub have_best_within() takes up a longer time when there is no close match, but there are lots of points that are not outside two times the outer match threshold. If that case comes up often enough to worry about, then you can use binary search for that (or those) cases. For that you will also have to sort the arrays within each hash value, but just the ones at the coarser threshold, where you're using binary seach. Note that binary search looses you time if there aren't many elements to look through, say less than 10 or 20. In that case linear search can be faster (There aren't very many arrays at the coarser thresholds, so it doesn't take so long to sort them all.)

If you decide to optimize search on the database side, the the hashes could become tied hashes, indexed not only by the floor()ed $y values, but by a sectrum id as well. The performance of that is another wrinkle to be thought through.

Good luck with your project. Feel free to message me on things that were unclear.


In reply to Re^3: Calculate the similarity of two arrays of numbers by rodion
in thread Calculate the similarity of two arrays of numbers by Ieronim

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.