jasonk has asked for the wisdom of the Perl Monks concerning the following question:

I've been fighting this chunk of code for days, and I think there probably exists a much simpler solution, but I can't see the forest for the trees. :)

I have some data that looks like this:

my @data = ( [ bless({'X' => '2289290.30567314', 'Y' => '507448.506452598'}, +'Point'), bless({'X' => '2289282.72943502', 'Y' => '507410.155776398'}, +'Point'), bless({'X' => '2289149', 'Y' => '507002.0625'}, 'Point'), bless({'X' => '2289055.31527948', 'Y' => '506731.474946477'}, +'Point'), ], [ bless({'X' => '2289563.40820205', 'Y' => '512818.142015762'}, +'Point'), bless({'X' => '2289542.5', 'Y' => '512756.0625'}, 'Point'), bless({'X' => '2289188.03442014', 'Y' => '510403.174475172'}, +'Point'), ], [ bless({'X' => '2287498.5', 'Y' => '502968.75'}, 'Point'), bless({'X' => '2287529.54227622', 'Y' => '502752.130034798'}, +'Point'), bless({'X' => '2287571.41650288', 'Y' => '502527.884072168'}, +'Point'), ], [ bless({'X' => '2288410.26157533', 'Y' => '505497.135963316'}, +'Point'), bless({'X' => '2288373.62815415', 'Y' => '505449.841277223'}, +'Point'), bless({'X' => '2287631.5', 'Y' => '503533.9375'}, 'Point'), bless({'X' => '2287593', 'Y' => '503471'}, 'Point'), ], [ bless({'X' => '2289188.03442014', 'Y' => '510403.174475172'}, +'Point'), bless({'X' => '2289106.00673453', 'Y' => '509893.307668147'}, +'Point') ], [ bless({'X' => '2288451.69937281', 'Y' => '505554.395479093'}, +'Point'), bless({'X' => '2288410.26157533', 'Y' => '505497.135963316'}, +'Point') ], [ bless({'X' => '2289106.00673453', 'Y' => '509893.307668147'}, +'Point'), bless({'X' => '2289299.27685392', 'Y' => '507531.897507685'}, +'Point'), bless({'X' => '2289290.30567314', 'Y' => '507448.506452598'}, +'Point'), ], [ bless({'X' => '2287593', 'Y' => '503471'}, 'Point'), bless({'X' => '2287498.5', 'Y' => '502968.75'}, 'Point'), ], [ bless({'X' => '2288891.13335288', 'Y' => '506310.118972175'}, +'Point'), bless({'X' => '2288451.69937281', 'Y' => '505554.395479093'}, +'Point'), ], [ bless({'X' => '2289055.31527948', 'Y' => '506731.474946477'}, +'Point'), bless({'X' => '2289012.5', 'Y' => '506607.8125'}, 'Point'), bless({'X' => '2288907', 'Y' => '506360.3125'}, 'Point'), bless({'X' => '2288891.13335288', 'Y' => '506310.118972175'}, +'Point'), ] );

Each of the array refs contains a series of points (latitude/longitude), that represents a chunk of a road. I am trying to find the refs that have beginning/ending points in common, and combine them, basically turning all the chunks into one long stretch of road. This is the code that I have now:

sub merge { my @allparts = @_; my @complete = @{shift(@allparts)}; ALLPARTS: while(@allparts) { for(0 .. $#allparts) { my @thispart = @{$allparts[$_]}; if($thispart[0] == $complete[$#complete]) { shift(@thispart); # remove part, so we don't double po +ints @complete = (@complete,@thispart); splice(@allparts,$_,1); next ALLPARTS; } if($complete[0] == $thispart[$#thispart]) { shift(@complete); # remove part @complete = (@thispart,@complete); splice(@allparts,$_,1); next ALLPARTS; } } die "Found pieces that don't fit!\n".Dumper(@allparts); } return @complete; }

This works as long as the parts can all be combined to form one large road, if the road consists of more than one unconnected section, this doesn't work. (To demonstrate this, call splice(@data,7,1) before calling merge(), to remove one part of the road). Ideally I'd like to have this function condense the data down to as few parts as possible, but because this code is so ugly, everything I've attempted has rapidly become unreadable and given me a headache. So if anybody has any ideas either how to accomplish my end goal, or a simpler way to implement the current merge() sub which would make it easier for me to expand, I'd appreciate any input.

For testing purposes, here is a simplified implementation o the Point package, so the overloading in the merge() function will work:

package Point; use overload '==' => sub { return ($_[0]->{X} == $_[1]->{X} && $_[0]->{Y} == $_[1]->{Y}); };

Replies are listed 'Best First'.
Re: AoA data merging
by rjray (Chaplain) on Mar 10, 2003 at 23:50 UTC

    Updated to fix a few errors, and replace "push" with "unshift" to cause the next loop iteration to work on the same running node as the previous iteration, until the node is shunted to @final

    Woof, what a mess.

    The core problem here looks to me as though your algorithm is specifically-designed to reduce the list sub-graphs to a single, contiguous graph. And since it doesn't appear to be designed to expect multiple resultant graphs, it doesn't like them.

    There are two things I'm going to ignore for now: the possibility of floating-point-error in the == overload, and the possibility that more than one distinct entry in the initial list shares a starting or ending point. That is, two lists from the set that have the same starting point, or have the same ending point, but are otherwise totally different.

    Disclaimer: this is totally pulled out of my ass, and has not been tested. If it doesn't work, fiddle with it a bit and I think you can get it to.

    @pointset = (...); @final = (); # The basic premise is to pull the first point off of # the queue and try to attach any of the remaining # points in the queue to it. If one of them can be # attached, remove it as well and attach correctly. # Then, push the now-longer element onto the end of # the queue. # # If you get through the queue without attaching any # new elements to the top or bottom, then push it # onto @final, instead. Eventually, @pointset will be # empty. while (defined($node = shift(@pointset))) { for $index (0 .. $#pointset) { if ($node->[0] == $pointset[$index]->[$#{$pointset[$index]}]) { # Add to front of $node shift(@$node); # Avoid point at $node->[0] dup'd @$node = (@{$pointset[$index]}, @$node); # This splice could be used directly above, but I # am aiming for clarity in this example. splice(@pointset, $index, 1); unshift(@pointset, $node); next; # Go back to the top of the while-loop } elsif ($node->[$#$node] == $pointset[$index]->[0]) { # Add to end of $node pop(@$node); # Avoid point at $node->[0] dup'd @$node = (@$node, @{$pointset[$index]}); # Again, this splice could be used directly. splice(@pointset, $index, 1); unshift(@pointset, $node); next; # Go back to the top of the while-loop } # If we reach this point, then none of the other # segments could attach to $node, so move it out # of the way. push(@final, $node); } } print scalar(@final) . " resulting graphs\n"; print Data::Dumper::Dumper(\@final);

    This isn't as efficient as it could be, since it re-queues the graph and drops out after attaching just one other graph. You could traverse the entire list of sub-graphs, but after enough iterations, you will have the same results (disclaimer about duplicated endpoints notwithstanding).

    --rjray

      This is similar to some of my own attempts, and it does run after a few minor fixes (missing a ] bracket on the first if, replacing 'continue' with 'next'), but this attempt has an unusual bug that I can't quite identify that makes it impossible to call it iteratively. The source data contains ten arrayrefs, if you run that array through this function, the return value succesfully merges those ten sections into five, so it worked, but that isn't all the return data contains, it also contains thirty-one copies of those five elements. The second iteration will contain the same five merged elements, along with 625 copies. The third iteration will have 198,130 copies, since my actual source data may contain thousands of points, you can imagine this has a negative impact on memory consumption. :)

      As for your other concerns, the actual code does deal with floats in a reasonable manner, and I have other code in another part of the application that checks to make sure I don't have multiple segments with the same starting or ending point.

        I've tweaked the code a bit (the fixes you point out, and a change to keep the current node under consideration until such point as it gets sent to @final). But I don't think that my tweaks will solve the problems you describe. Have you created a test application with reasonable data that you could send to me to run myself? Something minimalist, like in the original node?

        If so, feel free to email it to me at "rjray at blackperl.com".

        --rjray

Re: AoA data merging
by Jaap (Curate) on Mar 10, 2003 at 20:49 UTC
    I do not know the reason why you created such a complex data strucxture for the points (@data), but i would start with making that simpler.
    @data = ( [ [2289290.30567314, 507448.506452598], [2289282.72943502, 507410.155776398], [2289149, 507002.0625], [2289055.31527948, 506731.474946477], ], [ [2289563.40820205, 512818.142015762], [..., ...] ], );
    Or, you could use Math::Complex (see Genetic Algorithms in Perl) to use 2D points (vectors) and predefined functions for working with them.

      The reason it is so complex is that the Point class included here is but a minimalist version that only contains the elements needed to demonstrate this issue, the objects are actually more complex Geo::ShapeFile::Point objects, with much more information than X and Y coordinates. I could have demonstrated this by including the real data, but I think people would frown upon multi-gigabyte nodes :)

Re: AoA data merging
by graff (Chancellor) on Mar 11, 2003 at 06:24 UTC
    You have a clear and fairly simple (though not trivial) criterion that applies to a portion of a very large and complex data structure. One approach would be to design a relational table schema to store the full data structure; this could have one table with a record for each "road chunk", another table for each X,Y point referenced by a road chunk, and perhaps a third table that relates the chunk records to the point records. By structuring the data this way, it will be easier to identify and manipulate just the information that you need to solve the joining problem (in fact, you probably want yet another table to hold the "chunk-chain" findings).

    In fact, just breaking the problem down into separate data structures (tables) this way might clarify what the algorithm needs to do, whether or not you actually end up using an RDBMS.

    It may be sufficient just to have the set of X,Y points as a unified data structure, with info on which road-chunk(s) each point belongs to; joining the road chunks is now just a matter of determining, for each point, how many road chunks contain it, and whether two (or more?) chunks have a given point as a terminus.

    In other words, you started with an array of chunks, with each chunk containing an array of points -- try making a different structure the other way around: an array of points, and each one cites one or more road-chunks that it is a part of. This would be easiest if the point array is actually a hash, keyed by the X,Y coordinates.

    my %pointdata; my @chunkends; for ( my $i=0; $i<@data; $i++ ) # road-chunks { for ( my $j=0; $j<@{$data[$i]; $j++ ) # points in a chunk { my $key = join(",",$data[$i][$j]->{X},$data[$i][$j]->{Y}); push @{$pointdata{$key}}, sprintf("%5.5d:%5.5d",$i,$j); $chunkends[$i] = $key if ( $j == 0 ); } $chunkends[$i] .= "-$key"; } # Now the %pointdata array contains all the information # about junctures between chunks (these not necessarily # all end-point junctures -- perhaps some road chunks could # intersect at the mid-points?) Also, the @chunkends array # cites the %pointdata keys for the endpoints of each chunk; # if you only want to find end-point junctures, identify the # %pointdata keys with multi-element entries and grep for # each one among the strings in @chunkends.
    None of that is tested, but I hope it might be helpful.

      In other words, you started with an array of chunks, with each chunk containing an array of points -- try making a different structure the other way around: an array of points, and each one cites one or more road-chunks that it is a part of. This would be easiest if the point array is actually a hash, keyed by the X,Y coordinates.

      Although this is a good idea, the problem with this approach is that order matters, if the points of the arrayrefs get rearranged it can change the results. This does suggest an alternate approach that I will have to play with though, if I were to put just the ends of each chunk into a simpler data structure, I could order them using only two points per chunk, then it would be pretty easy to identify which chunks those end-points are associated with, it's likely that would eliminate some of the problems I've been having with my earlier approach.

      Update: In a node I hadn't read yet, someone else had already thought of this approach, and posted code. I love perlmonks! Thanks for all the help guys, I have several things to try out now based on all your suggestions.

Re: AoA data merging
by zengargoyle (Deacon) on Mar 11, 2003 at 06:47 UTC

    look at the Graph modules on CPAN. you can create a node foreach point x,y and add an edge to the graph for each segment in your dataset. the graph module can then give you paths between chosen nodes (or anything else you might like to know).

Re: AoA data merging
by Enlil (Parson) on Mar 11, 2003 at 10:06 UTC
    Here is another way to do it (I glanced through everything but BrowserUk's post), which he put up as I was finishing up. But here is one way you could do it. It requires one pass through. I tested it with two sets of data, the first being your original data posted, and the second something that I found a little easier to follow. The work is done by the following code:
    package Point; use strict; use warnings; use Data::Dumper; my @data = (...) #see data_sets below my @new_data; while ( @data ) { my $item = shift @data; my $match; foreach (0 .. $#data ) { if ( $data[$_][0]{X} == @{$item}[-1]->{X} and $data[$_][0]{Y} == @{$item}[-1]->{Y} ) { $match = ['front', $_ ]; last; } if ( $data[$_][-1]{X} == @{$item}[0]->{X} and $data[$_][-1]{Y} == @{$item}[0]->{Y} ) { $match = ['end' , $_ ]; last; } } if ( defined $match ) { if ( @{$match}[0] eq 'front' ) { splice @{$data[@{$match}[1]]},0,1,@{$item}; } else { splice @{$data[@{$match}[1]]},-1,1,@{$item}; } } else { push @new_data,$item; } } print Dumper \@new_data;
    The data for run #1:

    and the results

    and with your data the results were :

    -enlil

Re: AoA data merging
by BrowserUk (Patriarch) on Mar 11, 2003 at 09:10 UTC

    Whether this is any easier I'm not sure, but it should be more efficient as the actual merge only requires one pass.

    The idea is that you create a single AoA containing just the start points and end points of your road sections. You then sort the end points into reverse order. You can then process the sorted points in adjacent pairs (0&1,1&2,2&3 etc), and if the two are the same (within some definition of same given this is FP data), then you have found two sections that can be joined.

    It requires that the AoA elements contain not only the x,y pair, but also a reference back to the section that they are a start or end point of, and a flag indicating which:

    Given your array @data, you might use something like

    my @ends = map{ [$_->[0], $_, 'S'], [$_->[-1], $_, 'E'] } @data;

    to create the array of endpoints with references back to their parent sections, and a flag to differenciate between Start points and End points, chosen so that Starts will sort lexically before Ends (in reverse order).

    You then sort that using something like

    @ends = sort{ $b->[0]{X} <=> $a->[0]{X} || $b->[0]{Y} <=> $a->[0]{Y} || $b->[2] cmp $a->[2] } @ends;

    Depending upon the accuracy of you data, you might need to introduce some fudge factor into the comparisons as is common with FP data, though this is probably best saved for later when you come to determining whether two adjacent points are the same.

    Then the process of comparing and joining sections only requires a single pass. You might find this function useful for that.

    sub mapAdjPairs (&@) { my $code = shift; map { local ($a, $b) = (shift, $_[0]); $code->() } 0 .. @_-2; }

    Having created @ends and sorted it as above, your merge process becomes something like

    my @merged = mapAdjPairs{ if ($a->[2] eq 'S') { # $a is a start point if ( $b->[2] eq 'E' and $a->[0]{X} == $b->[0]{X} # Introduce FP fudge factor if n +eeded and $a->[0]{Y} == $b->[0]{Y} ) { # $b is and end and we found a coincident end/start pair push @{ $b->[1] }, @{ $a->[1] }[1.. $#{$a->[1]} ]; return; # we merged two sections so don't add to result } else { # We have a start point and didn't merge so add to the +results return $a->[1]; } } return; } @ends, $ends[0]; # Note:Repeat first element last to complete cycle

    Combine that all together produces the following output

    c:\test>240810 X 11 Y 12 X 12 Y 13 X 13 Y 14 X 14 Y 15 X 15 Y 16 X 16 Y 17 X 17 + Y 18 X 18 Y 19 X 19 Y 20 X 10 Y 10 X 11 Y 11 X 12 Y 12 X 13 Y 13 X 14 Y 14 X 15 Y 15 X 16 + Y 16 X 17 Y 17 X 18 Y 18 X 19 Y 19 X 20 Y 20 X 21 Y 21 X 22 Y +22 X 23 Y 23 X 24 Y 24 X 1 Y 2 X 1 Y 3 X 1 Y 4 c:\test>

    The process will work just as well for FP data, though you might need to fudge the comparisons as is quite usual.

    The complete POC code and output


    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.
Re: AoA data merging (final answer)
by jasonk (Parson) on Mar 11, 2003 at 23:46 UTC

    Based on the suggestions offered here and some discussions with other programmers, this is the working code I came up with, for anyone who is interested. Thanks for all the great suggestions!

    sub merge { my @allparts = @_; # make an array of just the end points my @ends = map{ [$_->[0], $_->[-1]] } @allparts; # give us a starting point my @sorted = (shift(@ends)); # while we still have unsorted ends OUTER: while(@ends) { # check each of the unsorted ends to see if they can # attach, we only compare them to the last element of # @sorted because if we have more than one element in # @sorted, then we already checked all of @ends against # the elements at the beginning for my $x (0 .. $#ends) { # see if we can attach it at the beginning if($sorted[-1]->[0] == $ends[$x]->[1]) { unshift(@{$sorted[-1]},@{splice(@ends,$x,1)}); next OUTER; } # see if we can attach it at the end if($ends[$x]->[0] == $sorted[-1]->[-1]) { push(@{$sorted[-1]},@{splice(@ends,$x,1)}); next OUTER; } } # if we get here, then we didn't find # any @ends that attach to anything # in @sorted, so we grab one of the # @ends to make a new section, and # continue push(@sorted,shift(@ends)); } # now go through the sections in @sorted, and replace # them with the appropriate complete data from @allparts. for my $x (0 .. $#sorted) { my @ends = @{$sorted[$x]}; my @full = (); # for each start/end combination while(my($start,$end) = splice(@ends,0,2)) { # find the chunk it came from #print "start=$start end=$end\n"; foreach(@allparts) { #print "0=$_->[0]\n"; #print Dumper($_); if(($_->[0] == $start) && ($_->[-1] == $end)) { pop(@full); # get rid of the duplicate piece push(@full,@{$_}); last; } } } $sorted[$x] = [@full]; } return @sorted; }