whakka has asked for the wisdom of the Perl Monks concerning the following question:

Using Geo::ShapeFile, basically what the title says - I have a long list of coordinates (~300K) that need to be matched to Census blocks (coverage is the entire US). The block information resides in a number of shapefiles - one for each county in each state, where each internal shape is a block.

Here's a naive nested loop implementation:

for my $shp ( @shapefiles ) { # Open the shapefile for processing my $shapefile = new Geo::ShapeFile($shp); # Point object for (x,y) address coordinate my $point = new Geo::ShapeFile::Point(X => $address->{'long'}, Y = +> $address->{'lat'}); if ( $shapefile->bounds_contains_point($point) ) { # Get data if the point falls inside this file for my $n ( 1..$shapefile->shapes() ) { my $shape = $shapefile->get_shp_record($n); if ( $shape->contains_point($point) ) { my %dbf = $shapefile->get_dbf_record($n); return \%dbf; } } } } # next shapefile

The bounds_contains_point method simply checks if the shapefile's max and min x,y values form a rectangle that contains the given point, so the same point could iterate through multiple shapefiles before finding its corresponding block ($shape->contains_point($point) actually does the leg work to see if the point falls inside).

I could potentially create a file with the centroid of each block(shape) mapping to its corresponding shapefile and the index of the shape within. Then I iterate through the centroids with minimum distance to the given point until I find the block(shape) that contains the given point. The centroids would have to be ordered or partitioned somehow so I wouldn't compute distances for each <point,centroid> combination. How you ask? That's a good question.

Of course it's more than likely this has been optimized before so if anyone knows about this I'd be grateful. Please don't point me to software like ArcGIS or Open Jump since I have access to these tools already and they don't offer an efficient solution. Otherwise, do comment!

Replies are listed 'Best First'.
Re: Need an intelligent join algorithm for matching coordinates to shapefiles (.shp)
by Joost (Canon) on Dec 17, 2008 at 19:59 UTC
Re: Need an intelligent join algorithm for matching coordinates to shapefiles (.shp)
by BrowserUk (Patriarch) on Dec 17, 2008 at 19:47 UTC
Re: Need an intelligent join algorithm for matching coordinates to shapefiles (.shp)
by kennethk (Abbot) on Dec 17, 2008 at 19:55 UTC

    Not being familiar with Census blocks, would it be fair to say they are a rectangular? Are they space filling and non-overlapping? Are they all the same size or of varying size? The regularity of the problem has direct impacts on how clever you can be with grid organization.

    At the least, I note that Geo::ShapeFile->new does file access, and as such it's likely a good idea to preload all shape objects rather that loading them at compare time - how much diskspace is devoted to your shape files?

      Census blocks are non-overlapping and are rectangular in shape and regular in size only in urban areas, otherwise (most of the US) they're irregular and of varying size. By "space-filling" I assume you mean points are contained within unique blocks, which is true.

      Pre-loading them is unfortunately not an option as collectively they're about 6GB although I'm sure there are smart ways of doing so, thanks.

        Space filling means that there is no portion of the map which does not have a block associated with it (no gaps). In Geo/ShapeFile, it says it only loads data as it needs it, so it's possible that only some fraction of the 6GB need be loaded for your problem - might be worth running a test script to check.

        Knowing how G-men think, the QuadTree approach Joost suggests is definitely worth trying - IIRC you essentially build a tree of hierarchical rectangles. You can then use this minimal information (which should certainly be less than 6GB) to compare instead of reloading point data in your loop.

        I think you've got that backwards. They should be rectangular in rural areas, like the underlying townships. I know they're certainly irregular in this urban area.

        UPDATE: Doh, sorry, I was thinking of tracts. Blocks are broken by natural boundaries, so you'd have to be in the desert to be square. OTOH, none of the divisions of census data in Boston are rectilinear.

        --
        In Bob We Trust, All Others Bring Data.