Zipcode Proximity script

hacker has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Zipcode Proximity script by blokhead (Monsignor) on Mar 31, 2003 at 16:26 UTC
If your search radius is so small (25 miles), You could eliminate 90% of your data before you do a lot of the number crunching using a less-accurate estimate first -- perhaps a good choice would be to check if the latitude/longitude are within ~10 degrees (or some constant appropriate for your data) of the origin and skip the coordinates if they are not. Maybe just comparing the number of degrees is a bad choice, but I also know that there are several great circle distance calculation algorithms around, with varying trade-offs between speed and accuracy. The key is to eliminate bad matches first with the low-accuracy algorithm, then use the high-accuracy one on the close ones. You may also wish to check out Geo::Distance or Geo::PostalCode to see how they do it. Geo::PostalCode claims to have a feature exactly as you describe. blokhead	[reply]
Re: Re: Zipcode Proximity script by paulbort (Hermit) on Mar 31, 2003 at 19:35 UTC
To add to this: If you don't really care about great circle distance, and you can get away with flat-earth distances, you could start by eliminating all points that are outside the smallest box that fits around your circle. This will eliminate most of the data in two passes. (Outside Lat bounds, and outside Long bounds.) If you store the master data in a SQL database with a couple of indexes, you could just do a query to get the starting set, then trim it with the circle algorighm you already have. The advantage this has over blokhead's method is that it will work correctly regardless of circle size, and eliminate more points, in exchange for four extra calculations (N, S, E, W bounds). (Ten degrees is a lot of land.) You might also want to consider storing the points in a database so that a SQL Query on a couple of indexes can do most of the data elimination for you in one pass. Also, if you multiply the Lat/Longs by 10,000, you can store and manipulate them as longs instead of singles. (This alone might be enough to solve your performance problem.) Just a quick stab at some SQL for this: (It assumes you've already calculated the bounding box.) `SELECT zip FROM zip_lat_long WHERE lat > $south_border AND lat < $north_border AND long > $east_border AND long < $west_border AND $max_distance 2 > ( ABS( lat - $center_lat ) 2 + ABS( long - $center_long ) ** 2 );` [download] (With luck an optimizer will index seek the first four conditions before crunching the last one.) This is a neat problem, I wish I had time to bang out some sample code. -- Spring: Forces, Coiled Again!	[reply] [d/l]
Re: Zipcode Proximity script by dga (Hermit) on Mar 31, 2003 at 16:38 UTC
I would pursue data reduction. One possibiliy of data reduction is that at 60 degrees north any place within 1 degree longitude of another place is less than about 30 miles away and latitude is about 60 miles per degree so a real fast first pass might just be to take for your example a subset of the places where `lat > 42.5 && lat < 43.5 and lon > 70 && lon < 72`. These would of course be calculated from each desired source place and give a box of about 30 x 40 miles around your source. Then do the real distanace calculation to get the actual values which will elimate a few zips from the small set, all of which can probably be in local storage in the script for speed. Another, very accurate, way to set the bounding box would be to go 25 miles straight north for the start zip and then calculate the longitude difference needed to go 25 miles then set the box at + and - the lat and + and - the lon you calculated from the center and you have an about 25x25 mile box then use the real formula to eliminate the corner zips. If you work in lat or lon order in from the corners toward the center, you can determine as you go that all closer lat/lon (depending on your order) are inside the circle and do not need to calculate them. This way requires the real distance calculation to be done once to make up the bounding box, then simple subtractions for all other calculations in pass 1, then run the real dist calc in pass 2 for the corner cases until you are close enough in not to need them and the rest get included for zero added cost. Have fun	[reply] [d/l]
Re: Zipcode Proximity script by tall_man (Parson) on Mar 31, 2003 at 16:18 UTC
If your C code is slow, a perl implementation will be even slower. What you need is a better data structure so you don't have to do a linear search over all codes, for example a quad tree. The module Tree::M might help. Update: I notice that your perl code uses a flat-geometric distance rather than a curved-earth great circle distance. That's probably good enough for typical geometric queries (like "50 miles from my house"). But you could do the same thing in the C program. You could also search using the distance squared and avoid all those square roots. `my $xdelta = $x1-$x2; my $ydelta = $y1-$y2; my $distance_squared = $xdelta$xdelta + $ydelta$ydelta;` [download]	[reply] [d/l]
Re: Zipcode Proximity script by hacker (Priest) on Mar 31, 2003 at 16:25 UTC
It gets better than that. The reason I need them in CSV format in two columns of origin,destination like that, is because I have to import them into... wait for it... a Microsoft Access .mdb file, so an ASP page can query it. The import only happens rarely, but often enough to want to optimize the tool that is doing the conversion. I'm slowly converting the system (carbon-based and silicon-based) to perl/mysql/php, depending on the needs, but for now, it has to remain in ADO/.mdb format during the transition. I'll take a peek at Tree::M though and see what that can do for me. Thanks.	[reply]
�Re: Zipcode Proximity script by merlyn (Sage) on Mar 31, 2003 at 19:53 UTC
PostgreSQL has built-in "geometric" types which can be used to perform functions like distance. You could select all points within a certain radius, and then sort the points to determine the closest, all at C speeds. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re^2: Zipcode Proximity script (what is "C"?) by Aristotle (Chancellor) on Apr 02, 2003 at 23:07 UTC
all at C speeds As in the language, or as in light? `:)` Makeshifts last the longest.	[reply]
Re: Zipcode Proximity script by CountZero (Bishop) on Mar 31, 2003 at 19:50 UTC
A simple trick to cut time in half is to check only for places which are further down the file, as the distance from A to B is the same as the distance from B to A. After you have found all A to B distances, just invert the list to get all B to A distances as well. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re: Zipcode Proximity script by Anonymous Monk on Mar 31, 2003 at 19:58 UTC
First of all, what everyone else has said about first trimming your data set applies: for a given longitude and latitude, you should be able to come up with boundaries that are definitely in your set and boundaries that are definitely out. That being said, you can speed up your comparison even in the C code by doing this: I assume that in your main loop you do something like: `while (read_from_file(&newzip, &newlat, &newlon)) { mydist = great_circle_distance(newlat,newlon,origlat,origlon); if (mydist < 25) { // do stuff to add it to file } if (mydist < 50) { ... } ... }` [download] Instead, replace it with this: `float dist1cosine = cos(25.0 / EARTH_RADIUS); float dist2cosine = cos(50.0 / EARTH_RADIUS); float dist3cosine = cos(75.0 / EARTH_RADIUS); while (read_from_file(&newzip, &newlat, &newlon)) { mydistcos = great_circle_distance_cosine(newlat,newlon, origlat,origlon); if (mydistcos >= dist1cosine) { // do stuff to add it to file } if (mydistcos >= dist2cosine) { ... } ... }` [download] where you've got: `static inline float great_circle_distance_cosine(float lat1, float long1, float lat2, float long2) { float delta_long, temp1, temp2, delta_lat; delta_lat = lat2 - lat1; delta_long = long2 - long1; temp1 = sin(lat1) * sin(lat2); temp2 = cos(delta_lat); temp2 -= temp1; temp2 = cos(delta_long); / result is cos(lat1) cos(lat2) [ cos(lon1 - lon2) ] + sin(lat1) sin(lat2) */ return (temp1 + temp2); }` [download] The point of this exercise is to avoid the atan2 and sqrt calls, which are done to take the arccos of sqrt(temp) in the original code. Over the range you're dealing with, arccos is strictly decreasing, (hence the switch from "<" to ">=" in the comparison) which means you don't need to take arccos every time in order to Actually, I think that you could translate this algorithm into perl and, even without applying any pre-distance filtering, get better results than your current program. Edit by tye, remove PRE tags surrounding CODE tags	[reply] [d/l] [select]
Re: Re: Zipcode Proximity script by fizbin (Chaplain) on Apr 01, 2003 at 04:33 UTC
Here's a way to do all that prefiltering people keep mentioning: First, some helper procedures: use constant EARTH_RADIUS => 3956; use constant PI => 4 * atan2 1, 1; sub make_quick_reject { # Args: lat (radians), long (radians), dist my($lat,$long,$dist) = @_; my($distcos) = cos($dist / EARTH_RADIUS); my($maxlatdiff) = atan2(sqrt(1 - ($distcos * $distcos)), $distcos); if (abs($lat) + $maxlatdiff > PI/2) { # special case - near poles return sub { abs($lat - $_[0]) > $maxlatdiff; }; } my($longmult); $longmult = 1 / cos(abs($lat) + $maxlatdiff); my($maxlongdiff) = $maxlatdiff * $longmult; return sub { abs($lat - $_[0]) > $maxlatdiff or abs($long - $_[1]) > + $maxlongdiff; } } sub make_quick_accept { # Args: lat (radians), long (radians), dist my($lat,$long,$dist) = @_; my($distcos) = cos($dist / EARTH_RADIUS); my($maxlatdiff) = atan2(sqrt(1 - ($distcos * $distcos)), $distcos) / + sqrt(2); my($longmult); if (abs($lat) < $maxlatdiff ) { # special case - near equator $longmult = 1; } else { $longmult = 1 / cos(abs($lat) - $maxlatdiff); } my($maxlongdiff) = $maxlatdiff * $longmult; return sub { abs($lat - $_[0]) < $maxlatdiff and abs($long - $_[1]) < $maxlongdiff; } } [download] Then, I assume that you've loaded your data into something like this: `# zipcode, lat, long @data = (['00210', 43.00589, 71.0132], ['00211', 43.00589, 71.0132], . +..);` [download] And then done something like this to it: `map { $_->[1] = PI/180; $_->[2] = PI / 180;} @data;` [download] Then you can do something like this: `$distance = 25; for $i (0..$#data) { my $quick_reject_routine = make_quick_reject($data[$i]->[1], $data[$i]->[2], $distance); my $quick_accept_routine = make_quick_accept($data[$i]->[1], $data[$i]->[2], $distance); for $record (grep(! $quick_reject_routine->($_->[1],$_->[2]), @data[$i+1..$#data])) { if ($quick_accept_routine->($record->[1],$record->[2]) or long_acceptance_routine($record, $data[$i], $distance) { # write it to some file or other } } }` [download] Where for "long_calculation_routine", you substitute in essentially what you're doing now. (or better yet, something analogous to the simpler C code suggested in the post I made before logging in) Extending this to the case of checking several distances at once shouldn't be hard - just construct an array of quick reject and quick accept procedures, and then check through them in order before trying the long acceptance routine. Something like this: @distances = qw(5 10 25 50 100); @distfiles = map { "0-$_.txt" } @distances; for $i (0..$#data) { my @quick_reject_routines = map { make_quick_reject($data[$i]->[1], $data[$i]->[2], $_); } @distances; my @quick_accept_routines = map { make_quick_accept($data[$i]->[1], $data[$i]->[2], $_); } @distances; OTHERCITY: for $j ($i..$#data)) { DISTANCE: for $d (0..$#distances) { if ($quick_reject_routines[$d]->($data[$j][1],$data[$j][2])) { next DISTANCE; } if ($quick_accept_routines[$d]->($data[$j][1],$data[$j][2]) or long_acceptance_routine($data[$j], $data[$i], $distance) { map { write_csv_line($data[$i], $data[$j], $_); write_csv_line($data[$j], $data[$i], $_); } @distfiles[$d..$#distances]; next OTHERCITY; } } } } [download] Of course, this still contains extraneous calls to long_acceptance_routine, but my hope is that the routines long_acceptance_routine depends on will somehow be caching recent results (see the Memoize modules) and so there won't be the performance penalty.	[reply] [d/l] [select]
Re: Zipcode Proximity script by fizbin (Chaplain) on Apr 01, 2003 at 15:32 UTC
Whoops. This is what I get for coding after my brain has already shut down for nightly maintenance. Those quick_accept and quick_reject routines will give false rejections with longitudes near the 180 degree longitude. (not an issue with US locations, I know, but...) Also, there's a stupid bit of time-wasting going on near the top there. Let's see if I can get it right this time: sub make_quick_reject { # Args: lat (radians), long (radians), dist my($lat,$long,$dist) = @_; my($maxlatdiff) = $dist / EARTH_RADIUS; if (abs($lat) + $maxlatdiff > PI/2) { # special case - near poles return sub { abs($lat - $_[0]) > $maxlatdiff; }; } my($longmult); $longmult = 1 / cos(abs($lat) + $maxlatdiff); my($maxlongdiff) = $maxlatdiff * $longmult; return sub { abs($lat - $_[0]) > $maxlatdiff or ( abs($long - $_[1]) > $maxlongdiff and abs($long - $_[1]) < 2PI - $maxlongdiff ); }; } sub make_quick_accept { # Args: lat (radians), long (radians), dist my($lat,$long,$dist) = @_; my($maxlatdiff) = $dist / EARTH_RADIUS; $maxlatdiff /= sqrt(2); my($longmult); if (abs($lat) < $maxlatdiff ) { # special case - near equator $longmult = 1; } else { $longmult = 1 / cos(abs($lat) - $maxlatdiff); } my($maxlongdiff) = $maxlatdiff $longmult; return sub { abs($lat - $_[0]) < $maxlatdiff and ( abs($long - $_[1]) < $maxlongdiff or abs($long - $_[1]) > 2*PI - $maxlongdiff ); }; } [download] And just again as a reminder, the sub's produced by these two routines accept longitudes and latitudes in RADIANS, not degrees. (Though reworking them to use degrees instead of radians should not be difficult at all) And if anyone uses this code without understanding how it works, they deserve what they get. No warranty of anything useful implied at all; the code may not even pass perl -c.	[reply] [d/l]
Re^3: Zipcode Proximity script by Anonymous Monk on Apr 18, 2010 at 06:33 UTC
instead of `my $maxlatdiff = atan2(sqrt(1 - ($distcos * $distcos)), $distcos);` you could do `use Math::Trig; my $maxlatdiff = acos($distcos);` [download]	[reply] [d/l] [select]
Re: Zipcode Proximity script by shotgunefx (Parson) on Mar 31, 2003 at 22:22 UTC
I had a similar problem and did a quick and dirty reduction by having a map for each state that contained all bordering states which is a quick and easy way to reduce the data. -Lee "To be civilized is to deny one's nature."	[reply]
Re: Re: Zipcode Proximity script by paulbort (Hermit) on Apr 01, 2003 at 15:52 UTC
Won't work. Check out how close 44030 (Conneaut, OH) is to 14775 (Ripley, NY). But there's a whole state in the way! (PA, if you're curious.) -- Spring: Forces, Coiled Again!	[reply]
Re: Re: Re: Zipcode Proximity script by shotgunefx (Parson) on Apr 01, 2003 at 16:29 UTC
It's simple, add the non-bordering states that have borders with N miles to that's states border-states. -Lee "To be civilized is to deny one's nature."	[reply]
Re: Re: Re: Re: Zipcode Proximity script by paulbort (Hermit) on Apr 03, 2003 at 15:58 UTC
Re: Re: Re: Re: Re: Zipcode Proximity script by shotgunefx (Parson) on Apr 03, 2003 at 18:07 UTC
Re: Zipcode Proximity script by Anonymous Monk on Mar 31, 2003 at 21:14 UTC
We had a job like this come through as a proposal a couple months ago. Guy had a salesman in zip code A, B, and C. Had a customer in zip code D,E, and F. Wanted to know which salesmen was closest. The results would be stored on his server. He was basically expecting a hash of each Zip Code centroid in the US measured to every other zip code centroid. Hate to bust the guy's bubble but it would've taked years to do all the calculations, assuming really fast calculations. So.... Maybe you could eliminate all the zips that aren't + or - 1 utm zone from the one you're measuring from. That might eliminate quite a few. Maybe speed it up a bit. Neil	[reply]
Re: Zipcode Proximity script by Anonymous Monk on Nov 13, 2007 at 01:14 UTC
check this out. This solution contains a Stored Procedure that accepts a zip code and mile radius and returns all zip codes within that radius. (Example: Give me all the zip codes within a 10-mile radius of 50325). The Solution comes with and a dataload script that will create a SQL Server database, the Stored Procedure containing the algorithm, and load over 42,00 zip codes. http://www.spikesolutions.net/ViewSolution.aspx?ID=1e7d197c-7441-40f4-8e87-5ecc3ef5ab60	[reply]