Re^3: fastest way to compare numerical strings?

Replies are listed 'Best First'.
Re^4: fastest way to compare numerical strings? (300_000 points in under 4 minutes!) by BrowserUk (Patriarch) on Jul 02, 2008 at 03:46 UTC
The code below finds all the groupings of points within 0.01 degrees of each other from 300_000 pairs of coordinates in under 4 minutes (3:42) on my less-than-stellar 2.66GHz single processor box. How it works Having generated the 300_000 pairs within the OPs prescribed bounds, it then normalises and scales the points (1000) to fit them onto a suitably sized image (~4000x4000) for plotting. It then generates an all white, 24-bit color image. It then iterates the normalised pairs and plots a cirle (20 pixels in diameter (0.01 2 * 1000), at the location given by the first scaled coordinate pair, in color 0 (corresponding to the pairs location in the arrays). When it goes to plot the second point, it first checks the color of the pixel at the scaled coordinates. If this is still white, then this point does not come within 0.01 of any point plotted so far, and it is plotted in a color that corresponds to its position in the array. However, if the pixel is some color other than white, this pair is within 0.01 of some point already plotted. In this case, the new circle is plotted in that color, forming a group of that color. It is also added to an array of points (keyed by that color), which effectively groups the cluster of points. This process continues for all the points. If the color at the centre of its location is not white, then its circle is plotted in the color found there and it is added to the HoAs keyed by that color. If that pixel is white, it is plotted in a new, unique color to await being joined (or not) by a subsequent point. In a single pass, all clusters are detected and can then be printed out or further processed. The code below also outputs two images. The first contains circles drawn for every point. The second contains just the points that became grouped with at least one other point. With randomly generated data, there is not much difference between the images, but they can be seen if you blow the images up a bit and flip-flop between them. It produces a file whereby the grouped indexes have been converted back to their corresponding coordinated and grouped. It looks like this (a small section): [ 10.7685250292969 24.0294289818115 ] [ 10.764576366333 24.0289633201904 ] ------ [ 11.8467422927246 22.0280153343506 ] [ 11.8505748185425 22.0356987510986 ] ------ [ 10.3691293842163 22.0120664238281 ] [ 10.3718005385742 22.0104366081543 ] ------ [ 10.9817528293457 22.6283695793457 ] [ 10.974320052002 22.6306978874512 ] ------ [ 12.2627455496826 23.6263988487549 ] [ 12.2549643609009 23.62430337146 ] ------ [ 13.0046296383057 23.9849582969971 ] [ 13.0082298898315 23.9801852653809 ] ------ [ 12.3460158833618 23.0528201469727 ] [ 12.3482224891357 23.0609692253418 ] ------ [ 13.6142335176392 24.0877530998535 ] [ 13.619808100647 24.0818159141846 ] ------ [ 10.4141905968628 21.2074031425781 ] [ 10.4150035568848 21.2060061577148 ] ------ [ 11.0738495861206 23.2290730705566 ] [ 11.0661845344849 23.2240672081299 ] [ 11.0827921463623 23.2314013786621 ] ------ [ 10.8340263796387 21.6198629234619 ] [ 10.8307745395508 21.6281284172363 ] [ 10.8416914312744 21.6245195396729 ] ------ [download] Further processing to do the date filtering is left as an exercise :) The test code Read more... (3 kB) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^5: fastest way to compare numerical strings? (300_000 points in under 4 minutes!) by salva (Canon) on Jul 04, 2008 at 08:01 UTC
your algorithm doesn't take into account that a new point can connect two previously unconnected clusters. For instance, with $scale = 1.0, the points (1.0, 0) (2.5, 0) (1.8, 0) will not be classified in the same cluster.	[reply]
Re^6: fastest way to compare numerical strings? (Problem acknowledged) by BrowserUk (Patriarch) on Jul 04, 2008 at 08:33 UTC
It took a few seconds to understand this. But yes, it is a problem. Thanks for bringing it to my attention. Whether the problem occurs depends on what order the points get plotted. Now the question is, is there a way of sorting the input to ensure that closer points get plotted before those further away (which would prevent it from occuring)? And what effect would sorting the input have upon the performance? An alternative fix would be to use some kind of AndOr or Xor mode when plotting; or perhaps something with the alpha channel, but I haven't thought that through yet. I'll need to think about it for a while. Have any potential fixes crossed your mind? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^7: fastest way to compare numerical strings? (Problem acknowledged) by salva (Canon) on Jul 04, 2008 at 09:55 UTC
Re^8: fastest way to compare numerical strings? (Problem fixed)) by BrowserUk (Patriarch) on Jul 05, 2008 at 07:01 UTC
Re^6: fastest way to compare numerical strings? (Potential solution) by BrowserUk (Patriarch) on Jul 04, 2008 at 09:49 UTC
I think I've come up with a potential fix, though I need to sleep before I will have the tuits to code it. The mechanism I envisage is to plot the circles larger than scale. This will cause points that are further away than the specified limit to be grouped artificially. You then replot just the points in each group, at the correct scaling, on a clean canvas using alpha-blending, to eliminate the false hits. The first pass will reduce the problem to a few small, isolated groups making the second pass only need to deal with a few points at a time, so the slowdown shouldn't be significant. I'll try to post a corrected solution later. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]

How it works