Lets not have me share the data just yet
A representative sample (even if made up), or just the bounding box of the dataset and precision of the points would allow us to generate random data that roughly matched the problem space to test possibilties.
Also, a brief description of the purpose of the code would help. For example, are you counting the points that lie within each cell of a 0.01 x 0.01 grid? Or are you trying to "cluster" the points regardless of any predefined gridding?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Okay, I am happy with speed now. Winning answers were a combination of:
- the system 'sort' step mentioned by jethro and apl and maybe others (and searching only 1000 adjacent lines).
- the 'abs' command of grep and maybe others.
The rest of the stuff sounds fascinating and potenitally useful. Thanks, truthfully, to everyone. Soon I hope to do this on a global daily dataset where n=800million or so. If anyone wants to hammer me on better methods and such please feel free. Again, thanks to all.
| [reply] |
Okey doke, while the tests run I'll type. Basically, the code is going through points representing fire detections by a NASA satellite (MODIS). Those points that fall close together (within .01) AND occur on consecutive days and at appropriate times ($numbdate and $timestamp) are written to a text file for review in other software (Arcview GIS) as part of ongoing research.
Latitude stats:
Minimum: 10.318842
Maximum: 14.124424
Mean: 11.24719
Standard Deviation: 0.805771
Longitude stats:
Minimum: 21.097507
Maximum: 24.912207
Mean: 22.474358
Standard Deviation: 0.974835
Thanks!
| [reply] |
The code below finds all the groupings of points within 0.01 degrees of each other from 300_000 pairs of coordinates in under 4 minutes (3:42) on my less-than-stellar 2.66GHz single processor box.
How it works
- Having generated the 300_000 pairs within the OPs prescribed bounds, it then normalises and scales the points (*1000) to fit them onto a suitably sized image (~4000x4000) for plotting.
- It then generates an all white, 24-bit color image.
- It then iterates the normalised pairs and plots a cirle (20 pixels in diameter (0.01 * 2 * 1000), at the location given by the first scaled coordinate pair, in color 0 (corresponding to the pairs location in the arrays).
When it goes to plot the second point, it first checks the color of the pixel at the scaled coordinates. If this is still white, then this point does not come within 0.01 of any point plotted so far, and it is plotted in a color that corresponds to its position in the array.
However, if the pixel is some color other than white, this pair is within 0.01 of some point already plotted. In this case, the new circle is plotted in that color, forming a group of that color. It is also added to an array of points (keyed by that color), which effectively groups the cluster of points.
This process continues for all the points. If the color at the centre of its location is not white, then its circle is plotted in the color found there and it is added to the HoAs keyed by that color. If that pixel is white, it is plotted in a new, unique color to await being joined (or not) by a subsequent point.
In a single pass, all clusters are detected and can then be printed out or further processed.
The code below also outputs two images. The first contains circles drawn for every point. The second contains just the points that became grouped with at least one other point. With randomly generated data, there is not much difference between the images, but they can be seen if you blow the images up a bit and flip-flop between them.
It produces a file whereby the grouped indexes have been converted back to their corresponding coordinated and grouped. It looks like this (a small section):
[ 10.7685250292969 24.0294289818115 ]
[ 10.764576366333 24.0289633201904 ]
------
[ 11.8467422927246 22.0280153343506 ]
[ 11.8505748185425 22.0356987510986 ]
------
[ 10.3691293842163 22.0120664238281 ]
[ 10.3718005385742 22.0104366081543 ]
------
[ 10.9817528293457 22.6283695793457 ]
[ 10.974320052002 22.6306978874512 ]
------
[ 12.2627455496826 23.6263988487549 ]
[ 12.2549643609009 23.62430337146 ]
------
[ 13.0046296383057 23.9849582969971 ]
[ 13.0082298898315 23.9801852653809 ]
------
[ 12.3460158833618 23.0528201469727 ]
[ 12.3482224891357 23.0609692253418 ]
------
[ 13.6142335176392 24.0877530998535 ]
[ 13.619808100647 24.0818159141846 ]
------
[ 10.4141905968628 21.2074031425781 ]
[ 10.4150035568848 21.2060061577148 ]
------
[ 11.0738495861206 23.2290730705566 ]
[ 11.0661845344849 23.2240672081299 ]
[ 11.0827921463623 23.2314013786621 ]
------
[ 10.8340263796387 21.6198629234619 ]
[ 10.8307745395508 21.6281284172363 ]
[ 10.8416914312744 21.6245195396729 ]
------
Further processing to do the date filtering is left as an exercise :)
The test code
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |