GrandFather has asked for the wisdom of the Perl Monks concerning the following question:

Background

The code below is supposed to process a group of partial related sequences to find the number of independent groups represented by the data. The original data was generated during an archaeological excavation. As the excavation progressed items were found to overlay other items and the relationship between items was recorded. This resulted in a large number of partial sequences.

The stimulus

The first task was to take the partial sequences and build longer chains then generate a graph to show the relationship between items. That has been done and shows a number of independent groupings of items.

The task now is to validate the groupings shown on the graph. The code below attempts to accomplish that and almost succeeds.

The problem

The problem is that the group beginning 23 in the output should be part of the large group beginning 6. That can be checked by noting in the original data that 52 overlays 99 and in the output 52 is in the large group and 99 is in the smaller group.

use strict; use warnings; my %overlays; open my $DATA, '<', 'PAIRS2Clean.DAT'; while (<$DATA>) { chomp; my @next = grep $_, split; while (@next > 1) { my $top = shift @next; $overlays{$top}{$next[0]} = 1; } } close $DATA; my @chains = buildChains (\%overlays); my %assdTops; # Top (group id) associated with a node # Run through chains and build groupings for my $chain (@chains) { my $top = shift @$chain; # Use first item in chain as group id my @priorTops = map {$assdTops{$_}} grep {exists $assdTops{$_}} @$ +chain; if (@priorTops) { # Found groups that can be amalgamated my %seen; # remove duplicates @priorTops = grep {! $seen{$_}++} @priorTops; # make sure old id node is included in super group $assdTops{$top} = $priorTops[0] unless $priorTops[0] == $top; # First item becomes super group id $top = shift @priorTops; # Merge nodes into super group for my $oldTop (@priorTops) { $assdTops{$_} = $top for grep {$assdTops{$_} == $oldTop} keys %assdTops; } } $assdTops{$_} = $top for @$chain; } my $done; # Find group ids that belong to another group until ($done) { $done = 1; for my $key (sort keys %assdTops) { next unless exists $assdTops{$assdTops{$key}} && $assdTops{$assdTops{$key}} != $assdTops{$key}; # Found nested node. Move it. $assdTops{$key} = $assdTops{$assdTops{$key}}; $done = 0; } } my %groups; # Build hash of group ids and group members $groups{$assdTops{$_}}{$_}++ > 1 and die "$_ is in more than one group +" for keys %assdTops; # print groups for my $group (sort {$a <=> $b} keys %groups) { print join (', ', $group, sort {$a <=> $b} keys %{$groups{$group}} +), "\n\n"; } sub buildChains { # Build chains my $overs = shift; my @chains; for my $top (sort {$a <=> $b} keys %$overs) { if (! exists $overlays{$top}) { push @chains, [$top]; next; } my @subChains = buildChains ($overlays{$top}); push @chains, [$top, @$_] for @subChains; } return @chains; }

Prints:

4, 42, 51, 76, 85, 86, 133, 666 6, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 25, + 26, 27, 28, 29, 30, 31, 32, 33, 35, 36, 37, 38, 39, 40, 44, 45, 46, +48, 49, 50, 52, 53, 54, 56, 57, 58, 61, 65, 66, 68, 72, 73, 80, 82, 8 +4, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 100, 101, 102, 103, 10 +4, 106, 107, 108, 109, 110, 111, 114, 117, 119, 120, 121, 122, 123, 1 +24, 125, 126, 127, 129, 130, 131, 132, 134, 136, 137, 139, 140, 141, +145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 158, 159, 160, + 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174 +, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 18 +8, 189, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 777, 8 +88, 999 23, 24, 43, 47, 74, 75, 98, 99, 112, 113, 116 34, 63 60, 64, 67 62, 81, 83, 115 71, 59 78, 69 105, 118

the data set

1 27 1 32 1 68 4 42 4 666 5 22 5 129 12 6 84 12 6 106 8 45 8 19 9 10 9 33 9 10 65 9 10 72 9 10 95 27 11 102 27 11 107 27 11 141 27 11 146 12 54 8 13 88 8 13 89 8 13 90 14 29 15 40 15 44 16 57 17 30 17 110 18 37 8 19 36 8 19 90 21 20 187 21 20 21 48 14 22 29 14 22 40 23 47 24 74 24 75 25 102 26 91 27 11 28 92 14 29 84 31 30 48 31 30 50 31 30 93 31 104 1 32 68 1 32 91 1 32 103 9 33 53 34 63 35 38 39 36 96 39 36 121 37 36 96 37 36 121 18 37 66 18 37 96 18 37 121 35 38 73 35 38 94 39 36 14 40 54 15 40 54 22 40 54 29 40 54 51 666 86 51 42 86 4 666 86 4 42 86 43 74 56 44 137 15 44 137 8 45 95 8 46 19 23 47 99 23 47 116 20 48 150 49 48 150 21 48 150 31 48 150 21 49 48 31 49 48 30 50 104 51 666 52 43 52 99 52 114 52 116 52 120 52 131 33 53 117 33 53 125 40 54 122 40 54 111 40 54 136 40 54 123 15 56 57 15 56 44 16 56 57 16 56 44 16 57 58 110 71 59 60 64 6 61 84 62 83 34 63 60 64 8 65 89 8 65 95 9 65 89 9 65 95 45 65 89 45 65 95 37 66 153 67 64 32 68 103 71 59 10 72 95 10 72 101 38 73 94 38 73 109 40 73 94 40 73 109 43 74 24 74 24 75 112 85 76 78 69 54 80 130 81 115 82 73 62 83 61 84 106 29 84 106 42 85 76 42 86 85 87 88 87 145 13 88 90 13 88 145 87 88 90 87 88 145 13 89 90 65 89 90 13 90 146 19 90 146 103 91 109 28 92 151 30 93 126 73 94 108 73 94 148 29 94 108 29 94 148 35 94 108 35 94 148 72 95 89 72 95 140 45 95 89 45 95 140 10 95 89 10 95 140 65 95 89 65 95 140 36 96 184 37 96 184 97 166 97 200 113 98 75 47 99 116 30 100 188 72 101 107 102 119 25 102 119 11 102 119 134 102 119 103 91 103 109 32 103 91 32 103 109 50 104 150 50 104 163 50 104 167 48 104 150 48 104 163 48 104 167 105 118 84 106 148 11 107 119 11 107 102 94 108 109 94 108 152 73 108 109 73 108 152 108 109 152 108 109 160 91 109 152 91 109 160 17 110 126 12 111 122 12 111 177 54 111 122 54 111 177 98 112 113 75 113 98 81 115 62 115 47 116 74 99 116 74 53 117 102 119 107 119 40 120 131 36 121 161 36 121 152 37 121 161 37 121 152 111 122 177 111 122 178 54 123 177 106 124 148 53 125 117 110 126 166 101 127 103 129 80 80 130 12 131 149 120 131 149 132 151 51 133 25 134 102 44 137 91 139 32 139 89 140 192 89 140 165 11 141 184 11 141 183 88 145 193 89 145 193 90 146 165 6 147 155 6 147 159 94 148 149 94 148 180 94 148 169 106 148 149 106 148 180 106 148 169 148 149 169 131 149 169 73 149 169 104 150 188 104 150 163 110 151 167 92 151 167 155 151 167 73 152 888 73 152 164 73 152 183 73 152 999 73 152 181 153 152 888 153 152 164 153 152 183 153 152 999 153 152 181 66 153 121 66 153 152 66 153 164 37 153 121 37 153 152 37 153 164 91 154 183 147 155 151 147 155 176 147 155 189 159 155 151 159 155 176 159 155 189 110 155 151 110 155 176 110 155 189 162 158 167 104 158 167 147 159 155 147 159 172 109 160 185 109 160 999 109 160 183 121 161 182 104 162 158 104 162 163 162 163 167 150 163 167 104 163 167 153 164 181 140 165 193 140 165 194 89 165 193 89 165 194 90 165 193 90 165 194 126 166 200 126 166 179 162 167 163 167 150 167 158 167 151 167 169 168 185 169 168 186 169 168 999 148 169 168 148 169 999 149 169 168 149 169 999 171 170 181 171 170 999 152 170 181 152 170 999 888 170 181 888 170 999 35 171 170 38 171 170 172 176 147 173 88 174 195 39 174 195 45 175 172 176 155 176 147 176 151 177 178 106 177 178 185 178 186 178 166 179 189 126 179 189 999 180 197 148 180 197 94 180 197 170 181 164 181 152 181 194 182 196 184 182 196 161 182 196 146 182 196 152 183 184 152 183 999 152 183 198 191 183 184 191 183 999 191 183 198 141 184 194 160 185 178 160 185 186 160 185 999 168 185 178 168 185 186 168 185 999 168 186 178 168 186 999 20 187 188 187 188 189 150 188 189 100 188 189 188 189 27 191 183 193 192 201 194 192 201 165 193 192 184 194 182 184 194 192 165 194 182 165 194 192 174 195 39 195 182 196 777 180 197 777 183 198 777 97 199 199 200 777 97 200 777 166 200 777 192 201 777

A further (but lesser) issue is that the code will be published along with the results as a demonstration that the groupings (at least) presented by the graph are valid. As it stands the code is likely too magical to be meaningful to the vast majority of readers of the archaeological paper in which these results are likely to be published. Are there any tricks I've missed in massaging the data that would allow simplifying the code for easier understanding? Should I use less Perl magic to get the job done (I've been told by the primary author that it doesn't matter)?


DWIM is Perl's answer to Gödel

Replies are listed 'Best First'.
Re: Determine group membership for partial sequences
by Anno (Deacon) on Aug 23, 2007 at 22:09 UTC
    It appears that what you are after is a relation among the nodes such that two nodes are related if one overlays the other, directly or indirectly (or if the nodes are identical). Direct overlay is what is stated in the data. The indirect relations can be described as the transitive closure of the relation of immediate overlays. The desired chains would be the equivalence classes of this extended relationship.

    Fortunately, there is a module (by Abigail) that implements transitive closure. It is the work-horse of the following solution:

    #!/usr/local/bin/perl use strict; use warnings; $| = 1; use Vi::QuickFix; use Algorithm::Graphs::TransitiveClosure qw( floyd_warshall); my %graph; my %nodes; while (<DATA>) { my @next = split; @nodes{ @next} = (); while ( @next > 1 ) { # build refexive and symmetric relation my $top = shift @next; my $below = $next[ 0]; $graph{ $top}->{ $top} = 1; $graph{ $top}->{ $below} = 1; $graph{ $below}->{ $top} = 1; } } # build transitive closure floyd_warshall( \ %graph); # pull off chains my @chains; while ( keys %nodes ) { my $node = each %nodes; my @chain = sort { $a <=> $b } keys %{ $graph{ $node} }; push @chains, \ @chain; delete @nodes{ @chain}; } print "@$_\n" for sort { $a->[ 0] <=> $b->[ 0] } @chains; exit; __DATA__ 1 27 1 32 1 68 ...
    The %graph hash is built much like %overlays in your code, except that both ascending and descending pairs are considered related. After the Floyd-Marshall algortithm has built the closure, the groups can be pulled off one by one.

    The resulting output

    1 5 6 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 +30 31 32 33 35 36 37 38 39 40 43 44 45 46 47 48 49 50 52 53 54 56 57 +58 61 65 66 68 72 73 74 75 80 82 84 87 88 89 90 91 92 93 94 95 96 97 +98 99 100 101 102 103 104 106 107 108 109 110 111 112 113 114 116 117 + 119 120 121 122 123 124 125 126 127 129 130 131 132 134 136 137 139 +140 141 145 146 147 148 149 150 151 152 153 154 155 158 159 160 161 1 +62 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 17 +9 180 181 182 183 184 185 186 187 188 189 191 192 193 194 195 196 197 + 198 199 200 201 777 888 999 4 42 51 76 85 86 133 666 34 63 59 71 60 64 67 62 81 83 115 69 78 105 118
    shows indeed only one large group which contains the spurious one that appeared in your output.

    Update: Come to think of it, what you're doing is determine the connected components of a graph. There are algorithms (and CPAN modules) that implement that directly. Using Floyd-Warshall in this way is possible, but probably less than optimal.

    Anno

      Yay! Exactly the sort of clean up I was looking for. Thank you.

      This is a one off task so the fact that there may be more efficient ways of achieving the result than using Floyd-Warshall doesn't really matter.


      DWIM is Perl's answer to Gödel
        Describing the result as connected components of a graph may be a way to explain the process in intuitive terms.

        Anno

Re: Determine group membership for partial sequences
by Limbic~Region (Chancellor) on Aug 23, 2007 at 14:10 UTC
    GrandFather,
    I have been eyeing your node since you posted it. It seems to me that it is unlikely that there is anything you could do to your code to make it more meaningful to the audience. Even if they were non-perl programmers, the following would require a fair amount of explanation:
    my @next = grep $_, split;
    • Implicit variables like $_
    • What is truth
    • Scoping and declaration
    • Default parameters to functions

    I think you would be better off by providing a plain english explanation of the algorithm. That way they can understand and agree with your approach even if they don't understand the code. I am not talking about comments nor even documentation. I think a completely distinct written explanation would be of the most benefit to this audience.

    Cheers - L~R

      I agree. I suspect however that the primary author of the paper will tell me that it suffices that the code is available and anyone who is interested has all the information they need to "perform the experiment" themselves. Which is right in line with the laziness virtue. ;)


      DWIM is Perl's answer to Gödel
        GrandFather,
        I'm sorry, I meant to include a line that says "For more information, see <link>" because I agree that the explanation of the algorithm should take a back seat in the article. Trouble of posting at PerlMonks when you are supposed to be working.

        Cheers - L~R

Re: Determine group membership for partial sequences
by BrowserUk (Patriarch) on Aug 23, 2007 at 19:47 UTC

    A written description of what the data represents might help us understand the data and perhaps offer a solution to your described problem--assuming one exists.

    The best interpretation I could muster from your description is that the data lines:

    182 196 777 180 197 777

    Means that artifact 182 was found in a higher(or maybe lower, I couldn't decide?) strata than artifact 196,which in turn was in a higher strata than artifact 777. And that a similar relationship exists between the artifacts in the second line. But although artifacts 196 and 197 where definitely both higher than artifact 777, it does not mean they were both at the same level, nor illustrate anything about their relative depths. Same goes for artifacts 182 and 180.

    So from those two lines alone, any of the following (plus many other), relative positionings is possible:

    182 180 182 180 182 180 197 196 196 197 196 182 180 197 196 197 777 777 777 777

    I first thought the purpose of the code was to try and 'sort' the artifacts from top to bottom be resolving all the relative positionings, but the form of the output doesn't seem to bear that out. If that were the case, it would appear to put far too many artifacts at one level?

    My next thought was that the data indicated a physical overlap in the x/y plane. Ie. That some part of artifact 196 physically overlays some part of artifact 777. And the same for 197.

    And that the purpose was to say that all these artifacts were in one part of the dig, (one spatial grouping), and these were grouped in another part of the dig. And there was no overlap between any of the items within those two (or more) groupings.

    That seems more likely, but it would be easier to understand if you said up front what the data actually means.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.