fastest way to compare numerical strings?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: fastest way to compare numerical strings? by psini (Deacon) on Jun 30, 2008 at 18:10 UTC
An obvious improvement could be to sort your array first. So you can check each value on its neighbors only reducing the tests from NN to NM where M is the average population of clusters more 1. Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."	[reply]
Re: fastest way to compare numerical strings? by pc88mxer (Vicar) on Jun 30, 2008 at 18:27 UTC
It's clear that you want to find nearby points in a 2-dimensional space. One effective way is to overlay a grid which groups the data into cells. Then you only have to compare points from adjacent cells. Some pseudo-code: `my %CellMembers; for my $i (0..$npoints) { my $cellid = int($lat[$i]).",".int($long[$i]); push(@{$CellMembers{$cellid}}, $i); } # iterate over cells for my $ci (keys %CellMembers) { for my $cj (...cells within distance 1 of $ci...) { for my $i (@{$CellMembers{$ci}}) { for my $j (@{$CellMembers{$cj}} { ...compare points $i and $j... } } } }` [download] Lots of nested loops, but probably a big time saver.	[reply] [d/l]
Re^2: fastest way to compare numerical strings? by salva (Canon) on Jul 01, 2008 at 07:11 UTC
that is probably the best solution, so let's see some real (untested) code! use POSIX qw(floor); my @lat = ...; my @long = ...; my $radius = 0.01; my $radius2 = $radius 2; my @flat = map { floor($_/$radius) } @lat; my @flong = map { floor($_/$radius) } @long; my %cell; for my $i (0..$#lat) { push @{$cell{$flat[$i], $flong[$i]}}, $i } for my $i (0..$#lat) { my @near; for my $dflat(-1, 0, 1) { my $flat = $flat[$i] + $dflat; for my $dflong(-1, 0, 1) { my $flong = $flong[$i] + $dflong; if (my $cell = $cell{$flat, $flong}) { for my $j (@$cell) { if ((($lat[$j] - $lat[$i]) 2 + ($long[$j] - $long[$i]) ** + 2) < $radius) { push @near, $j } } } } } @near = sort @near; print "points near $i: @near\n"; } [download]	[reply] [d/l]
Re: fastest way to compare numerical strings? by grep (Monsignor) on Jun 30, 2008 at 18:12 UTC
Just a quick note: The take the abs value of the difference between the 2 numbers, check if that is greater than .01. That should read cleaner and get rid of most of those checks. Also as a favor for the people trying to read your code - pick a style of indenting. `#untested for ($n=0;$n<=$#lat;$n++) { for ($v=0;$v<=$#lat;$v++) { if ( abs( $lat[$n] - $lat[$v] ) < .01 and abs( $long[$n] - $l +ong[$v] ) < .01 ) { ##OTHER CODE } } }` [download] Update: fixed typo and minor edit grep `One dead unjugged rabbit fish later...`	[reply] [d/l]
Re: fastest way to compare numerical strings? (O(N) ) by BrowserUk (Patriarch) on Jun 30, 2008 at 18:58 UTC
See Re: Speeding up ... (O(N) determination of point in polygon regardless of complexity) for a solution to this problem. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re: fastest way to compare numerical strings? by jethro (Monsignor) on Jun 30, 2008 at 18:19 UTC
Sort the array, then just compare consecutive values. Sorting speed is nlog(n) instead of n^2 b) If you don't mind some numbers reported twice and the values are all somewhat equally distributed over a fixed range you might seperate them into buckets and only compare all the numbers in each bucket. For example if all numbers are between 0.000000 to 9.9999999, you might generate 100 buckets, with the first bucket having all the numbers between 0.0 and 0.11, the second between 0.1 and 0.21 (the buckets have to overlap to catch the edge cases between two buckets). The speed is n+(n/100)^2100, which might in some cases be lower than the sort solution.	[reply]
Re: fastest way to compare numerical strings? by jds17 (Pilgrim) on Jun 30, 2008 at 19:14 UTC
In case not too many numbers cluster together around a few points, an alternative is to first form clusters of numbers where the numbers are at most 2 * $delta apart ($delta = 0.01 in your case) and then only process the candidate clusters to find numbers at most $delta apart. For the following code there is only one pass through the input array at the beginning and then sorts of small arrays ("small" in case the assumption made at the beginning holds). #maximum distance we are looking for $delta = 0.01; #test array @a = (1.02, 1.03, 6.01, 9, 1.04, 1.011, 1.025, 1.01, 0.005, -0.002); #"discretize" points to neighboring points, scale by 1/$delta #to simplify computation for (@a) { push @{$h{int($_/$delta)}}, $_; push @{$h{int($_/$delta-1)}}, $_; push @{$h{int($_/$delta+1)}}, $_; } #handle clusters for (keys %h) { #in case the corresponding array has more than one element, #we know that it contains at least one pair not #further apart than 2 * $delta, otherwise ignore it if (@{$h{$_}} > 1) { @sorted = sort @{$h{$_}}; for (0..@sorted-2) { $r = $sorted[$_]; $s = $sorted[$_+1]; #filter out neighboring pairs, since we do not need to #process the numbers further, we cram them into a #string for the final output $near{"$r, $s"} = 1 if ($r < $s && $s <= $r + $delta); } } } print "Not further than $delta apart are the following pairs:\n"; print "$_\n" for (keys %near); [download] Output: `Not further than 0.01 apart are the following pairs: -0.002, 0.005 1.02, 1.025 1.025, 1.03 1.011, 1.02 1.03, 1.04 1.01, 1.011` [download] Update: Tested using the Benchmark module and a fixed array of 300000 numbers randomly distributed between 0 and 300, the near number determining part took about 8 seconds on my reasonably modern machine. I.e. the benchmark test starts like this: `use strict; use warnings; use Benchmark; #maximum distance we are looking for my $delta = 0.01; #test array my @a; for (1..300000) { push @a, rand() * 300; } my ($r, $s); timethis ( 10 => sub { ...` [download] Update: Improved the description, it gave the impression we first look for all numbers in the array not more than 2 * $delta apart, which is not the case.	[reply] [d/l] [select]
Re: fastest way to compare numerical strings? by FunkyMonk (Bishop) on Jun 30, 2008 at 18:56 UTC
You may get a very substantial improvement by only checking the longitudes if the latitudes are close. You can do this easily by splitting your if into two. This runs in around 6 seconds here! `my @lat = map { rand( 180 ) - 90 } 1 .. 300_000; my @long = map { rand( 180 ) - 90 } 1 .. @lat; for my $lat ( 0 .. $#lat - 1 ) { if ( abs( $lat[$lat] - $lat[$lat+1] ) < 0.01 ) { for my $long ( 0 .. $#long - 1 ) { if ( abs( $long[$long] - $long[$long+1] ) < 0.01 ) { print "\$lat[$lat] = $lat[$lat] \$long[$long] = $long[ +$long]\n"; } } } }` [download] Depending the spread of your points, this may give a very worthwhile improvement. But, if your points are all fairly close anyway, it won't be worthwhile. Unless I state otherwise, all my code runs with strict and warnings	[reply] [d/l]
Re^2: fastest way to compare numerical strings? by pc88mxer (Vicar) on Jun 30, 2008 at 23:56 UTC
Unfortunately this does not solve exactly the same problem because you are only checking to see if `$lat[$i]` and `$lat[$i+1]` are close. The OP wants to find all `$i` and `$j` where `$lat[$i]` is close to `$lat[$j]`. You can get closer to what the OP wanted if you first sorted the points based on `$lat[...]`. However, you still would want to check if `$lat[$i+2], $lat[$i+3]`, etc. were close to `$lat[$i]`.	[reply] [d/l] [select]
Re: fastest way to compare numerical strings? by pc88mxer (Vicar) on Jun 30, 2008 at 23:21 UTC
Let me suggest the non-perl solution: have your database do it for you. The magic phrase you need to tell Google seems to be "geodistance" and it yields the following interesting results: Geodistance seearch with Mysql references to SphinxSearch references to postgis references to mysql spatial extensions	[reply]
Re^2: fastest way to compare numerical strings? by BrowserUk (Patriarch) on Jun 30, 2008 at 23:50 UTC
If the OP will supply the data, I pretty much guarentee that this can be solved directly in Perl more quickly that you can load the data into MySql and add a "geospatial index", let alone construct & run the query, and then retrieve the results. And with less lines of code. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^3: fastest way to compare numerical strings? by pc88mxer (Vicar) on Jul 01, 2008 at 00:17 UTC
True, but there's always the possibility that the data is already (or wants to be) in a database. I'm just throwing out other options.	[reply]
Re: fastest way to compare numerical strings? by apl (Monsignor) on Jun 30, 2008 at 18:20 UTC
Use your systems sort utility on the file first. The you can compare adjacent records, reducing your number of tests to N - 1 rather than N**2.	[reply]
Re: fastest way to compare numerical strings? by Anonymous Monk on Jul 01, 2008 at 13:44 UTC
Is there a reason why your inner loop starts $v at 0? Your loop seems to start by comparing 0 to 0, then to 1, then to 2...When $n increments you compare 1 to 0, 1 to 1, etc. But aren't all the comparisons for $v <= $n ones that you've already done? There doesn't seem to be an obvious value to comparing 0 to 1, and then later comparing 1 to 0. They are, after all, both members of the same list. Could you change the inner loop to start with $v = $n + 1 and reduce the number of comparisons significantly?	[reply]
Re: fastest way to compare numerical strings? by Anonymous Monk on Jul 02, 2008 at 01:15 UTC
Hi All, Original poster / Anonymous (non)Monk here. Thanks for these great comments. I write code as the snail crawls. I am attempting to grok your advice and have already blown portions of my mind. Thank you. I am trying things out now and will report any solutions as they arise. A couple notes: 1) Yes the PSQL/MySql solution would be good but for practical reasons are less desirable right now. This code snippet is part of a long chain, so getting into SQL might happen later but right now I need to complete the chain with the one hammer I somewhat know: Perl. 2. All lat/long values are relatively close together, within a few thousand square kilometers, just fyi... 3. Lets not have me share the data just yet, thanks for the offer though. Thanks again and I'll post the results ASAP. It may or may not have anything to do with what you mention above, but the above certainly jogged the noodle which is helping.	[reply]
Re^2: fastest way to compare numerical strings? by BrowserUk (Patriarch) on Jul 02, 2008 at 01:44 UTC
Lets not have me share the data just yet A representative sample (even if made up), or just the bounding box of the dataset and precision of the points would allow us to generate random data that roughly matched the problem space to test possibilties. Also, a brief description of the purpose of the code would help. For example, are you counting the points that lie within each cell of a 0.01 x 0.01 grid? Or are you trying to "cluster" the points regardless of any predefined gridding? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^3: fastest way to compare numerical strings? by Anonymous Monk on Jul 02, 2008 at 02:39 UTC
Okay, I am happy with speed now. Winning answers were a combination of: the system 'sort' step mentioned by jethro and apl and maybe others (and searching only 1000 adjacent lines). the 'abs' command of grep and maybe others. The rest of the stuff sounds fascinating and potenitally useful. Thanks, truthfully, to everyone. Soon I hope to do this on a global daily dataset where n=800million or so. If anyone wants to hammer me on better methods and such please feel free. Again, thanks to all.	[reply]
Re^3: fastest way to compare numerical strings? by Anonymous Monk on Jul 02, 2008 at 01:52 UTC
Okey doke, while the tests run I'll type. Basically, the code is going through points representing fire detections by a NASA satellite (MODIS). Those points that fall close together (within .01) AND occur on consecutive days and at appropriate times ($numbdate and $timestamp) are written to a text file for review in other software (Arcview GIS) as part of ongoing research. Latitude stats: Minimum: 10.318842 Maximum: 14.124424 Mean: 11.24719 Standard Deviation: 0.805771 Longitude stats: Minimum: 21.097507 Maximum: 24.912207 Mean: 22.474358 Standard Deviation: 0.974835 Thanks!	[reply]
Re^4: fastest way to compare numerical strings? (300_000 points in under 4 minutes!) by BrowserUk (Patriarch) on Jul 02, 2008 at 03:46 UTC
Re^5: fastest way to compare numerical strings? (300_000 points in under 4 minutes!) by salva (Canon) on Jul 04, 2008 at 08:01 UTC
Some notes below your chosen depth have not been shown here
Re: fastest way to compare numerical strings? by Sagacity (Monk) on Jul 04, 2008 at 16:58 UTC
Drawing from my experience when developing a Do Not Call List discriminator, I too had thousands of comparisons to make of sets of numbers, each list had thousands of ten digit numbers. In the beginning we were looking for a match, then took the steps of what to do with it. It was slow! And then... What I found that impacted the speed the most was adding the thought of elimination as the primary criteria in my comparisons. As you went through a process of comparisons, each step had a valid reason to either continue comparing, or most importantly to speed, stop comparing and either log the results, or dump it and grab the NEXT line/entry/whatever. So, the loop system will perform the best when you have inserted the reasons for LAST and NEXT. I acknowledge that your comparisons differ from what I was dealing with, but I am also sure that the results will improve if you can implement this thought process inot your code.	[reply]
Re: fastest way to compare numerical strings? by salva (Canon) on Jul 07, 2008 at 09:15 UTC
I have just upload Algorithm::ClusterPoints to CPAN. On my 3GHz PIV machine, it is able to find the clusters from a set of 300_000 points in around 25 seconds.	[reply]