comment on

The code below finds all the groupings of points within 0.01 degrees of each other from 300_000 pairs of coordinates in under 4 minutes (3:42) on my less-than-stellar 2.66GHz single processor box.

How it works

Having generated the 300_000 pairs within the OPs prescribed bounds, it then normalises and scales the points (*1000) to fit them onto a suitably sized image (~4000x4000) for plotting.
It then generates an all white, 24-bit color image.
It then iterates the normalised pairs and plots a cirle (20 pixels in diameter (0.01 * 2 * 1000), at the location given by the first scaled coordinate pair, in color 0 (corresponding to the pairs location in the arrays).
When it goes to plot the second point, it first checks the color of the pixel at the scaled coordinates. If this is still white, then this point does not come within 0.01 of any point plotted so far, and it is plotted in a color that corresponds to its position in the array.
However, if the pixel is some color other than white, this pair is within 0.01 of some point already plotted. In this case, the new circle is plotted in that color, forming a group of that color. It is also added to an array of points (keyed by that color), which effectively groups the cluster of points.
This process continues for all the points. If the color at the centre of its location is not white, then its circle is plotted in the color found there and it is added to the HoAs keyed by that color. If that pixel is white, it is plotted in a new, unique color to await being joined (or not) by a subsequent point.
In a single pass, all clusters are detected and can then be printed out or further processed.

The code below also outputs two images. The first contains circles drawn for every point. The second contains just the points that became grouped with at least one other point. With randomly generated data, there is not much difference between the images, but they can be seen if you blow the images up a bit and flip-flop between them.

It produces a file whereby the grouped indexes have been converted back to their corresponding coordinated and grouped. It looks like this (a small section):

[ 10.7685250292969 24.0294289818115 ]
[ 10.764576366333 24.0289633201904 ]
------
[ 11.8467422927246 22.0280153343506 ]
[ 11.8505748185425 22.0356987510986 ]
------
[ 10.3691293842163 22.0120664238281 ]
[ 10.3718005385742 22.0104366081543 ]
------
[ 10.9817528293457 22.6283695793457 ]
[ 10.974320052002 22.6306978874512 ]
------
[ 12.2627455496826 23.6263988487549 ]
[ 12.2549643609009 23.62430337146 ]
------
[ 13.0046296383057 23.9849582969971 ]
[ 13.0082298898315 23.9801852653809 ]
------
[ 12.3460158833618 23.0528201469727 ]
[ 12.3482224891357 23.0609692253418 ]
------
[ 13.6142335176392 24.0877530998535 ]
[ 13.619808100647 24.0818159141846 ]
------
[ 10.4141905968628 21.2074031425781 ]
[ 10.4150035568848 21.2060061577148 ]
------
[ 11.0738495861206 23.2290730705566 ]
[ 11.0661845344849 23.2240672081299 ]
[ 11.0827921463623 23.2314013786621 ]
------
[ 10.8340263796387 21.6198629234619 ]
[ 10.8307745395508 21.6281284172363 ]
[ 10.8416914312744 21.6245195396729 ]
------
[download]

Further processing to do the date filtering is left as an exercise :)

The test code

#! perl -slw
use strict;
use Data::Dump qw[ pp ];
use List::Util qw[ reduce ];
use GD;

use constant {
    WHITE => 16777215,
    MinX => 10.318842, MaxX => 14.124424,
    MinY => 21.097507, MaxY => 24.912207,
};
use constant {
    SpanX => MaxX - MinX, SpanY => MaxY - MinY,
};

sub rgb2n{ unpack 'N', pack 'CCCC', 0, @_ }
sub n2rgb{ unpack 'xCCC', pack 'N', $_[0] }

our $POINTS ||= 300_000;

my @points = map [ MinX + rand( SpanX ), MinY + rand( SpanY ) ], 1 .. 
+$POINTS;

## Normalise to our grid, scale by 1000 and integerise
my @scaled = map [
    int( ( $_->[ 0 ] - MinX ) * 1000 ),
    int( ( $_->[ 1 ] - MinY ) * 1000 ),
], @points;

my( $sizeX, $sizeY ) = ( int( SpanX * 1000 ), int( SpanY * 1000 ) );
my $img = GD::Image->new( $sizeX, $sizeY , 1 );
$img->filledRectangle( 0, 0, $sizeX, $sizeY, rgb2n( 255, 255, 255 ) );


my %groups;
for my $i ( 0 .. $#scaled ) {
    my $color = $img->getPixel( @{ $scaled[ $i ] } );
    if( $color != WHITE ) {
        push @{ $groups{ $color } }, $i;
    }
    else {
        $color = $i;
    }
    $img->filledEllipse( @{ $scaled[ $i ] }, 20, 20, $i );
}

## Save the image showing all points/circles
open IMG, '>:raw', '694790all.png' or die $!;
print IMG $img->png;
close IMG;
system 1, '694790all.png';

undef $img;

## Produce a file of clusters of points
open OUT, '>', '694790.groups' or die $!;
for my $key ( sort { $a <=> $b } keys %groups ) {
    print OUT "[ @{ $points[ $_ ] } ]" for $key, @{ $groups{ $key } };
    print OUT "------";
}
close OUT;

## Produce second image showing just the clustered points.
$img = GD::Image->new( $sizeX, $sizeY , 1 );
$img->filledRectangle( 0, 0, $sizeX, $sizeY, rgb2n( 255, 255, 255 ) );

for my $key ( sort { $a <=> $b } keys %groups ) {
    $img->filledEllipse( @{ $scaled[ $key ] }, 20, 20, $key );
    $img->filledEllipse( @{ $scaled[ $_ ] }, 20, 20, $key ) for @{ $gr
+oups{ $key } };
}

open IMG, '>:raw', '694790grouped.png' or die $!;
print IMG $img->png;
close IMG;
system 1, '694790grouped.png';


__END__
Latitude stats:
Minimum: 10.318842
Maximum: 14.124424
Mean: 11.24719
Standard Deviation: 0.805771


Longitude stats:
Minimum: 21.097507
Maximum: 24.912207
Mean: 22.474358
Standard Deviation: 0.974835
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re^4: fastest way to compare numerical strings? (300_000 points in under 4 minutes!) by BrowserUk
in thread fastest way to compare numerical strings? by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.