Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Zipcode Proximity script

by hacker (Priest)
on Mar 31, 2003 at 16:03 UTC ( [id://246967] : perlquestion . print w/replies, xml ) Need Help??

hacker has asked for the wisdom of the Perl Monks concerning the following question:

I've been working with some C code that crunches the 2000 US Census data into CSV files, based on the specified proximity to the origin zipcode. The problem is that the C code is horribly slow, and I can't seem to figure out why. It takes my PIII/1.3Ghz/512mb RAM machine about 20 minutes to crunch the 987k input data file for zipcodes matching within a 0-25 radius of the givin origin zipcode. That seems very slow.

The master 2000 Census data file contains records in this format:

ZIP_CODE ONGITUD ATITUD 00210 71.0132 43.00589 00211 71.0132 43.00589 00212 71.0132 43.00589 00213 71.0132 43.00589 00214 71.0132 43.00589 00215 71.0132 43.00589 ...

My output file, separate for each type of range (0-25.txt for zipcodes within 0-25 miles of the origin, 0-50.txt for zipcodes within 0-50 miles of the origin, etc.), contains entries such as:

00210,00210 00210,00211 00211,00210 00210,00212 00212,00210 00210,00213 ...

For each given zipcode found in the master file (where origin == 00210 in this case, to start with), I want to output a file that contains all matching zipcodes within the specified proximity to that zipcode. So in the example above, all of the zipcodes within 0-25 miles of 00210 would be output to 0-25.txt, a csv file containing the data shown above.

I have the working radii functions which do this, and does work (but is very slow), and looks like:

#define EARTH_RADIUS 3956 static inline float deg_to_rad(float deg) { return (deg * M_PI / 180.0); } /* Function to calculate Great Circle distance between two points. */ static inline float great_circle_distance(float lat1, float long1, float lat2, float long2) { float delta_long, delta_lat, temp, distance; /* Find the deltas */ delta_lat = lat2 - lat1; delta_long = long2 - long1; /* Find the GC distance */ temp = pow(sin(delta_lat / 2.0), 2) + cos(lat1) * cos(lat2) * pow(sin(delta_long / 2.0), 2); distance = EARTH_RADIUS * 2 * atan2(sqrt(temp), sqrt(1 - temp)); return (distance); }

In perl, this would be:

my $distance = sqrt(($x1-$x2)**2+($y1-$y2)**2);

My goal is to convert this over to perl, both so I can gain the speed and efficiency of perl (as well as make this portable to Windows systems, where the current C code doesn't quite run yet), as well as expand my knowledge of perl in general.

Has anyone done this? Any pointers that might be useful here?

Replies are listed 'Best First'.
Re: Zipcode Proximity script
by blokhead (Monsignor) on Mar 31, 2003 at 16:26 UTC
    If your search radius is so small (25 miles), You could eliminate 90% of your data before you do a lot of the number crunching using a less-accurate estimate first -- perhaps a good choice would be to check if the latitude/longitude are within ~10 degrees (or some constant appropriate for your data) of the origin and skip the coordinates if they are not. Maybe just comparing the number of degrees is a bad choice, but I also know that there are several great circle distance calculation algorithms around, with varying trade-offs between speed and accuracy. The key is to eliminate bad matches first with the low-accuracy algorithm, then use the high-accuracy one on the close ones.

    You may also wish to check out Geo::Distance or Geo::PostalCode to see how they do it. Geo::PostalCode claims to have a feature exactly as you describe.


      To add to this: If you don't really care about great circle distance, and you can get away with flat-earth distances, you could start by eliminating all points that are outside the smallest box that fits around your circle. This will eliminate most of the data in two passes. (Outside Lat bounds, and outside Long bounds.) If you store the master data in a SQL database with a couple of indexes, you could just do a query to get the starting set, then trim it with the circle algorighm you already have.

      The advantage this has over blokhead's method is that it will work correctly regardless of circle size, and eliminate more points, in exchange for four extra calculations (N, S, E, W bounds). (Ten degrees is a lot of land.)

      You might also want to consider storing the points in a database so that a SQL Query on a couple of indexes can do most of the data elimination for you in one pass.

      Also, if you multiply the Lat/Longs by 10,000, you can store and manipulate them as longs instead of singles. (This alone might be enough to solve your performance problem.)

      Just a quick stab at some SQL for this: (It assumes you've already calculated the bounding box.)
      SELECT zip FROM zip_lat_long WHERE lat > $south_border AND lat < $north_border AND long > $east_border AND long < $west_border AND $max_distance ** 2 > ( ABS( lat - $center_lat ) ** 2 + ABS( long - $center_long ) ** 2 );
      (With luck an optimizer will index seek the first four conditions before crunching the last one.)

      This is a neat problem, I wish I had time to bang out some sample code.
      Spring: Forces, Coiled Again!
Re: Zipcode Proximity script
by dga (Hermit) on Mar 31, 2003 at 16:38 UTC

    I would pursue data reduction.

    One possibiliy of data reduction is that at 60 degrees north any place within 1 degree longitude of another place is less than about 30 miles away and latitude is about 60 miles per degree so a real fast first pass might just be to take for your example a subset of the places where lat > 42.5 && lat < 43.5 and lon > 70 && lon < 72. These would of course be calculated from each desired source place and give a box of about 30 x 40 miles around your source. Then do the real distanace calculation to get the actual values which will elimate a few zips from the small set, all of which can probably be in local storage in the script for speed.

    Another, very accurate, way to set the bounding box would be to go 25 miles straight north for the start zip and then calculate the longitude difference needed to go 25 miles then set the box at + and - the lat and + and - the lon you calculated from the center and you have an about 25x25 mile box then use the real formula to eliminate the corner zips. If you work in lat or lon order in from the corners toward the center, you can determine as you go that all closer lat/lon (depending on your order) are inside the circle and do not need to calculate them. This way requires the real distance calculation to be done once to make up the bounding box, then simple subtractions for all other calculations in pass 1, then run the real dist calc in pass 2 for the corner cases until you are close enough in not to need them and the rest get included for zero added cost.

    Have fun

Re: Zipcode Proximity script
by tall_man (Parson) on Mar 31, 2003 at 16:18 UTC
    If your C code is slow, a perl implementation will be even slower. What you need is a better data structure so you don't have to do a linear search over all codes, for example a quad tree. The module Tree::M might help.

    Update: I notice that your perl code uses a flat-geometric distance rather than a curved-earth great circle distance. That's probably good enough for typical geometric queries (like "50 miles from my house"). But you could do the same thing in the C program. You could also search using the distance squared and avoid all those square roots.

    my $xdelta = $x1-$x2; my $ydelta = $y1-$y2; my $distance_squared = $xdelta*$xdelta + $ydelta*$ydelta;
      It gets better than that. The reason I need them in CSV format in two columns of origin,destination like that, is because I have to import them into... wait for it... a Microsoft Access .mdb file, so an ASP page can query it. The import only happens rarely, but often enough to want to optimize the tool that is doing the conversion.

      I'm slowly converting the system (carbon-based and silicon-based) to perl/mysql/php, depending on the needs, but for now, it has to remain in ADO/.mdb format during the transition.

      I'll take a peek at Tree::M though and see what that can do for me. Thanks.

•Re: Zipcode Proximity script
by merlyn (Sage) on Mar 31, 2003 at 19:53 UTC
      all at C speeds
      As in the language, or as in light? :)

      Makeshifts last the longest.

Re: Zipcode Proximity script
by CountZero (Bishop) on Mar 31, 2003 at 19:50 UTC

    A simple trick to cut time in half is to check only for places which are further down the file, as the distance from A to B is the same as the distance from B to A.

    After you have found all A to B distances, just invert the list to get all B to A distances as well.


    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Zipcode Proximity script
by Anonymous Monk on Mar 31, 2003 at 19:58 UTC
    First of all, what everyone else has said about first trimming your data set applies: for a given longitude and latitude, you should be able to come up with boundaries that are definitely in your set and boundaries that are definitely out. That being said, you can speed up your comparison even in the C code by doing this: I assume that in your main loop you do something like:
    while (read_from_file(&newzip, &newlat, &newlon)) { mydist = great_circle_distance(newlat,newlon,origlat,origlon); if (mydist < 25) { // do stuff to add it to file } if (mydist < 50) { ... } ... }
    Instead, replace it with this:
    float dist1cosine = cos(25.0 / EARTH_RADIUS); float dist2cosine = cos(50.0 / EARTH_RADIUS); float dist3cosine = cos(75.0 / EARTH_RADIUS); while (read_from_file(&newzip, &newlat, &newlon)) { mydistcos = great_circle_distance_cosine(newlat,newlon, origlat,origlon); if (mydistcos >= dist1cosine) { // do stuff to add it to file } if (mydistcos >= dist2cosine) { ... } ... }
    where you've got:
    static inline float great_circle_distance_cosine(float lat1, float long1, float lat2, float long2) { float delta_long, temp1, temp2, delta_lat; delta_lat = lat2 - lat1; delta_long = long2 - long1; temp1 = sin(lat1) * sin(lat2); temp2 = cos(delta_lat); temp2 -= temp1; temp2 *= cos(delta_long); /* result is cos(lat1) cos(lat2) [ cos(lon1 - lon2) ] + sin(lat1) sin(lat2) */ return (temp1 + temp2); }
    The point of this exercise is to avoid the atan2 and sqrt calls, which are done to take the arccos of sqrt(temp) in the original code. Over the range you're dealing with, arccos is strictly decreasing, (hence the switch from "<" to ">=" in the comparison) which means you don't need to take arccos every time in order to Actually, I think that you could translate this algorithm into perl and, even without applying any pre-distance filtering, get better results than your current program.

    Edit by tye, remove PRE tags surrounding CODE tags

      Here's a way to do all that prefiltering people keep mentioning: First, some helper procedures:
      use constant EARTH_RADIUS => 3956; use constant PI => 4 * atan2 1, 1; sub make_quick_reject { # Args: lat (radians), long (radians), dist my($lat,$long,$dist) = @_; my($distcos) = cos($dist / EARTH_RADIUS); my($maxlatdiff) = atan2(sqrt(1 - ($distcos * $distcos)), $distcos); if (abs($lat) + $maxlatdiff > PI/2) { # special case - near poles return sub { abs($lat - $_[0]) > $maxlatdiff; }; } my($longmult); $longmult = 1 / cos(abs($lat) + $maxlatdiff); my($maxlongdiff) = $maxlatdiff * $longmult; return sub { abs($lat - $_[0]) > $maxlatdiff or abs($long - $_[1]) > + $maxlongdiff; } } sub make_quick_accept { # Args: lat (radians), long (radians), dist my($lat,$long,$dist) = @_; my($distcos) = cos($dist / EARTH_RADIUS); my($maxlatdiff) = atan2(sqrt(1 - ($distcos * $distcos)), $distcos) / + sqrt(2); my($longmult); if (abs($lat) < $maxlatdiff ) { # special case - near equator $longmult = 1; } else { $longmult = 1 / cos(abs($lat) - $maxlatdiff); } my($maxlongdiff) = $maxlatdiff * $longmult; return sub { abs($lat - $_[0]) < $maxlatdiff and abs($long - $_[1]) < $maxlongdiff; } }
      Then, I assume that you've loaded your data into something like this:
      # zipcode, lat, long @data = (['00210', 43.00589, 71.0132], ['00211', 43.00589, 71.0132], . +..);
      And then done something like this to it:
      map { $_->[1] *= PI/180; $_->[2] *= PI / 180;} @data;
      Then you can do something like this:
      $distance = 25; for $i (0..$#data) { my $quick_reject_routine = make_quick_reject($data[$i]->[1], $data[$i]->[2], $distance); my $quick_accept_routine = make_quick_accept($data[$i]->[1], $data[$i]->[2], $distance); for $record (grep(! $quick_reject_routine->($_->[1],$_->[2]), @data[$i+1..$#data])) { if ($quick_accept_routine->($record->[1],$record->[2]) or long_acceptance_routine($record, $data[$i], $distance) { # write it to some file or other } } }
      Where for "long_calculation_routine", you substitute in essentially what you're doing now. (or better yet, something analogous to the simpler C code suggested in the post I made before logging in) Extending this to the case of checking several distances at once shouldn't be hard - just construct an array of quick reject and quick accept procedures, and then check through them in order before trying the long acceptance routine. Something like this:
      @distances = qw(5 10 25 50 100); @distfiles = map { "0-$_.txt" } @distances; for $i (0..$#data) { my @quick_reject_routines = map { make_quick_reject($data[$i]->[1], $data[$i]->[2], $_); } @distances; my @quick_accept_routines = map { make_quick_accept($data[$i]->[1], $data[$i]->[2], $_); } @distances; OTHERCITY: for $j ($i..$#data)) { DISTANCE: for $d (0..$#distances) { if ($quick_reject_routines[$d]->($data[$j][1],$data[$j][2])) { next DISTANCE; } if ($quick_accept_routines[$d]->($data[$j][1],$data[$j][2]) or long_acceptance_routine($data[$j], $data[$i], $distance) { map { write_csv_line($data[$i], $data[$j], $_); write_csv_line($data[$j], $data[$i], $_); } @distfiles[$d..$#distances]; next OTHERCITY; } } } }
      Of course, this still contains extraneous calls to long_acceptance_routine, but my hope is that the routines long_acceptance_routine depends on will somehow be caching recent results (see the Memoize modules) and so there won't be the performance penalty.

        Whoops. This is what I get for coding after my brain has already shut down for nightly maintenance.

        Those quick_accept and quick_reject routines will give false rejections with longitudes near the 180 degree longitude. (not an issue with US locations, I know, but...) Also, there's a stupid bit of time-wasting going on near the top there.

        Let's see if I can get it right this time:

        sub make_quick_reject { # Args: lat (radians), long (radians), dist my($lat,$long,$dist) = @_; my($maxlatdiff) = $dist / EARTH_RADIUS; if (abs($lat) + $maxlatdiff > PI/2) { # special case - near poles return sub { abs($lat - $_[0]) > $maxlatdiff; }; } my($longmult); $longmult = 1 / cos(abs($lat) + $maxlatdiff); my($maxlongdiff) = $maxlatdiff * $longmult; return sub { abs($lat - $_[0]) > $maxlatdiff or ( abs($long - $_[1]) > $maxlongdiff and abs($long - $_[1]) < 2*PI - $maxlongdiff ); }; } sub make_quick_accept { # Args: lat (radians), long (radians), dist my($lat,$long,$dist) = @_; my($maxlatdiff) = $dist / EARTH_RADIUS; $maxlatdiff /= sqrt(2); my($longmult); if (abs($lat) < $maxlatdiff ) { # special case - near equator $longmult = 1; } else { $longmult = 1 / cos(abs($lat) - $maxlatdiff); } my($maxlongdiff) = $maxlatdiff * $longmult; return sub { abs($lat - $_[0]) < $maxlatdiff and ( abs($long - $_[1]) < $maxlongdiff or abs($long - $_[1]) > 2*PI - $maxlongdiff ); }; }

        And just again as a reminder, the sub's produced by these two routines accept longitudes and latitudes in RADIANS, not degrees. (Though reworking them to use degrees instead of radians should not be difficult at all)

        And if anyone uses this code without understanding how it works, they deserve what they get. No warranty of anything useful implied at all; the code may not even pass perl -c.

        instead of my $maxlatdiff = atan2(sqrt(1 - ($distcos * $distcos)), $distcos); you could do
        use Math::Trig; my $maxlatdiff = acos($distcos);
Re: Zipcode Proximity script
by shotgunefx (Parson) on Mar 31, 2003 at 22:22 UTC
    I had a similar problem and did a quick and dirty reduction by having a map for each state that contained all bordering states which is a quick and easy way to reduce the data.


    "To be civilized is to deny one's nature."
      Won't work. Check out how close 44030 (Conneaut, OH) is to 14775 (Ripley, NY). But there's a whole state in the way! (PA, if you're curious.)
      Spring: Forces, Coiled Again!
        It's simple, add the non-bordering states that have borders with N miles to that's states border-states.


        "To be civilized is to deny one's nature."
Re: Zipcode Proximity script
by Anonymous Monk on Mar 31, 2003 at 21:14 UTC
    We had a job like this come through as a proposal a couple months ago. Guy had a salesman in zip code A, B, and C. Had a customer in zip code D,E, and F. Wanted to know which salesmen was closest.

    The results would be stored on his server. He was basically expecting a hash of each Zip Code centroid in the US measured to every other zip code centroid. Hate to bust the guy's bubble but it would've taked years to do all the calculations, assuming really fast calculations.

    So.... Maybe you could eliminate all the zips that aren't + or - 1 utm zone from the one you're measuring from. That might eliminate quite a few. Maybe speed it up a bit. Neil
Re: Zipcode Proximity script
by Anonymous Monk on Nov 13, 2007 at 01:14 UTC
    check this out. This solution contains a Stored Procedure that accepts a zip code and mile radius and returns all zip codes within that radius. (Example: Give me all the zip codes within a 10-mile radius of 50325). The Solution comes with and a dataload script that will create a SQL Server database, the Stored Procedure containing the algorithm, and load over 42,00 zip codes.