Determine group membership for partial sequences

GrandFather has asked for the wisdom of the Perl Monks concerning the following question:

Background

The code below is supposed to process a group of partial related sequences to find the number of independent groups represented by the data. The original data was generated during an archaeological excavation. As the excavation progressed items were found to overlay other items and the relationship between items was recorded. This resulted in a large number of partial sequences.

The stimulus

The first task was to take the partial sequences and build longer chains then generate a graph to show the relationship between items. That has been done and shows a number of independent groupings of items.

The task now is to validate the groupings shown on the graph. The code below attempts to accomplish that and almost succeeds.

The problem

The problem is that the group beginning 23 in the output should be part of the large group beginning 6. That can be checked by noting in the original data that 52 overlays 99 and in the output 52 is in the large group and 99 is in the smaller group.

use strict;
use warnings;

my %overlays;

open my $DATA, '<', 'PAIRS2Clean.DAT';
while (<$DATA>) {
    chomp;
    my @next = grep $_, split;
    
    while (@next > 1) {
        my $top = shift @next;
        
        $overlays{$top}{$next[0]} = 1;
    }
}
close $DATA;

my @chains = buildChains (\%overlays);
my %assdTops; # Top (group id) associated with a node

# Run through chains and build groupings
for my $chain (@chains) {
    my $top = shift @$chain; # Use first item in chain as group id
    my @priorTops = map {$assdTops{$_}} grep {exists $assdTops{$_}} @$
+chain;
    
    if (@priorTops) {
        # Found groups that can be amalgamated
        my %seen;
        
        # remove duplicates
        @priorTops = grep {! $seen{$_}++} @priorTops;
        
        # make sure old id node is included in super group
        $assdTops{$top} = $priorTops[0] unless $priorTops[0] == $top;
        
        # First item becomes super group id
        $top = shift @priorTops;

        # Merge nodes into super group
        for my $oldTop (@priorTops) {
            $assdTops{$_} = $top for
                grep {$assdTops{$_} == $oldTop} keys %assdTops;
        }
    }
    
    $assdTops{$_} = $top for @$chain;
}

my $done;

# Find group ids that belong to another group
until ($done) {
    $done = 1;

    for my $key (sort keys %assdTops) {
        next unless exists $assdTops{$assdTops{$key}} &&
            $assdTops{$assdTops{$key}} != $assdTops{$key};
        
        # Found nested node. Move it.
        $assdTops{$key} = $assdTops{$assdTops{$key}};
        $done = 0;
    }
}

my %groups;

# Build hash of group ids and group members
$groups{$assdTops{$_}}{$_}++ > 1 and die "$_ is in more than one group
+"
    for keys %assdTops;

# print groups
for my $group (sort {$a <=> $b} keys %groups) {
    print join (', ', $group, sort {$a <=> $b} keys %{$groups{$group}}
+), "\n\n";
}


sub buildChains {
    # Build chains
    my $overs = shift;
    my @chains;
    
    for my $top (sort {$a <=> $b} keys %$overs) {
        if (! exists $overlays{$top}) {
            push @chains, [$top];
            next;
        }
        
        my @subChains = buildChains ($overlays{$top});
        push @chains, [$top, @$_] for @subChains;
    }

    return @chains;
}
[download]

Prints:

4, 42, 51, 76, 85, 86, 133, 666

6, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 25,
+ 26, 27, 28, 29, 30, 31, 32, 33, 35, 36, 37, 38, 39, 40, 44, 45, 46, 
+48, 49, 50, 52, 53, 54, 56, 57, 58, 61, 65, 66, 68, 72, 73, 80, 82, 8
+4, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 100, 101, 102, 103, 10
+4, 106, 107, 108, 109, 110, 111, 114, 117, 119, 120, 121, 122, 123, 1
+24, 125, 126, 127, 129, 130, 131, 132, 134, 136, 137, 139, 140, 141, 
+145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 158, 159, 160,
+ 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174
+, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 18
+8, 189, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 777, 8
+88, 999

23, 24, 43, 47, 74, 75, 98, 99, 112, 113, 116

34, 63

60, 64, 67

62, 81, 83, 115

71, 59

78, 69

105, 118
[download]

the data set

  1     27
  1     32
  1     68
  4     42
  4    666
  5     22
  5    129
 12      6     84
 12      6    106
  8     45
  8     19
  9     10
  9     33
  9     10     65
  9     10     72
  9     10     95
 27     11    102
 27     11    107
 27     11    141
 27     11    146
 12     54
  8     13     88
  8     13     89
  8     13     90
 14     29
 15     40
 15     44
 16     57
 17     30
 17    110
 18     37
  8     19     36
  8     19     90
 21     20    187
 21     20
 21     48
 14     22     29
 14     22     40
 23     47
 24     74
 24     75
 25    102
 26     91
 27     11
 28     92
 14     29     84
 31     30     48
 31     30     50
 31     30     93
 31    104
  1     32     68
  1     32     91
  1     32    103
  9     33     53
 34     63
 35     38
 39     36     96
 39     36    121
 37     36     96
 37     36    121
 18     37     66
 18     37     96
 18     37    121
 35     38     73
 35     38     94
 39     36
 14     40     54
 15     40     54
 22     40     54
 29     40     54
 51    666     86
 51     42     86
  4    666     86
  4     42     86
 43     74
 56     44    137
 15     44    137
  8     45     95
  8     46     19
 23     47     99
 23     47    116
 20     48    150
 49     48    150
 21     48    150
 31     48    150
 21     49     48
 31     49     48
 30     50    104
 51    666
 52     43
 52     99
 52    114
 52    116
 52    120
 52    131
 33     53     117
 33     53     125
 40     54     122
 40     54     111
 40     54     136
 40     54     123
 15     56      57
 15     56      44
 16     56      57
 16     56      44
 16     57
 58    110
 71     59
 60     64
  6     61      84
 62     83
 34     63
 60     64
  8     65      89
  8     65      95
  9     65      89
  9     65      95
 45     65      89
 45     65      95
 37     66     153
 67     64
 32     68     103
 71     59
 10     72      95
 10     72     101
 38     73      94
 38     73     109
 40     73      94
 40     73     109
 43     74
 24     74
 24     75     112
 85     76
 78     69
 54     80     130
 81    115
 82     73
 62     83
 61     84     106
 29     84     106
 42     85      76
 42     86      85
 87     88
 87    145
 13     88      90
 13     88     145
 87     88      90
 87     88     145
 13     89      90
 65     89      90
 13     90     146
 19     90     146
103     91     109
 28     92     151
 30     93     126
 73     94     108
 73     94     148
 29     94     108
 29     94     148
 35     94     108
 35     94     148
 72     95      89
 72     95     140
 45     95      89
 45     95     140
 10     95      89
 10     95     140
 65     95      89
 65     95     140
 36     96     184
 37     96     184
 97    166
 97    200
113     98      75
 47     99     116
 30    100     188
 72    101
107    102     119
 25    102     119
 11    102     119
134    102     119
103     91
103    109
 32    103      91
 32    103     109
 50    104     150
 50    104     163
 50    104     167
 48    104     150
 48    104     163
 48    104     167
105    118
 84    106     148
 11    107     119
 11    107     102
 94    108     109
 94    108     152
 73    108     109
 73    108     152
108    109     152
108    109     160
 91    109     152
 91    109     160
 17    110     126
 12    111     122
 12   111     177
 54    111     122
 54    111     177
 98    112
113     75
113     98
 81    115
 62    115
 47    116      74
 99    116      74
 53    117
102    119
107    119
 40    120     131
 36    121     161
 36    121     152
 37    121     161
 37    121     152
111    122     177
111    122     178
 54    123     177
106    124     148
 53    125     117
110    126     166
101    127
103    129      80
 80    130
 12    131     149
120    131     149
132    151
 51    133
 25    134     102
 44    137
 91    139
 32    139
 89    140     192
 89    140     165
 11    141     184
 11    141     183
 88    145     193
 89    145     193
 90    146     165
  6    147     155
  6    147     159
 94    148     149
 94    148     180
 94    148     169
106    148     149
106    148     180
106    148     169
148    149     169
131    149     169
 73    149     169
104    150     188
104    150     163
110    151     167
 92    151     167
155    151     167
 73    152     888
 73    152     164
 73    152     183
 73    152     999
 73    152     181
153    152     888
153    152     164
153    152     183
153    152     999
153    152     181
 66    153     121
 66    153     152
 66    153     164
 37    153     121
 37    153     152
 37    153     164
 91    154     183
147    155     151
147    155     176
147    155     189
159    155     151
159    155     176
159    155     189
110    155     151
110    155     176
110    155     189
162    158     167
104    158     167
147    159     155
147    159     172
109    160     185
109    160     999
109    160     183
121    161     182
104    162     158
104    162     163
162    163     167
150    163     167
104    163     167
153    164     181
140    165     193
140    165     194
 89    165     193
 89    165     194
 90    165     193
 90    165     194
126    166     200
126    166     179
162    167
163    167
150    167
158    167
151    167
169    168     185
169    168     186
169    168     999
148    169     168
148    169     999
149    169     168
149    169     999
171    170     181
171    170     999
152    170     181
152    170     999
888    170     181
888    170     999
 35    171     170
 38    171     170
172    176
147    173
 88    174     195
 39    174     195
 45    175
172    176
155    176
147    176
151    177     178
106    177     178
185    178
186    178
166    179     189
126    179     189
999    180     197
148    180     197
 94    180     197
170    181
164    181
152    181
194    182     196
184    182     196
161    182     196
146    182     196
152    183     184
152    183     999
152    183     198
191    183     184
191    183     999
191    183     198
141    184     194
160    185     178
160    185     186
160    185     999
168    185     178
168    185     186
168    185     999
168    186     178
168    186     999
 20    187     188
187    188     189
150    188     189
100    188     189
188    189
 27    191     183
193    192     201
194    192     201
165    193     192
184    194     182
184    194     192
165    194     182
165    194     192
174    195
 39    195
182    196     777
180    197     777
183    198     777
 97    199
199    200     777
 97    200     777
166    200     777
192    201     777
[download]

A further (but lesser) issue is that the code will be published along with the results as a demonstration that the groupings (at least) presented by the graph are valid. As it stands the code is likely too magical to be meaningful to the vast majority of readers of the archaeological paper in which these results are likely to be published. Are there any tricks I've missed in massaging the data that would allow simplifying the code for easier understanding? Should I use less Perl magic to get the job done (I've been told by the primary author that it doesn't matter)?

DWIM is Perl's answer to Gödel

Comment on Determine group membership for partial sequences Select or Download Code

Replies are listed 'Best First'.
Re: Determine group membership for partial sequences by Anno (Deacon) on Aug 23, 2007 at 22:09 UTC
It appears that what you are after is a relation among the nodes such that two nodes are related if one overlays the other, directly or indirectly (or if the nodes are identical). Direct overlay is what is stated in the data. The indirect relations can be described as the transitive closure of the relation of immediate overlays. The desired chains would be the equivalence classes of this extended relationship. Fortunately, there is a module (by Abigail) that implements transitive closure. It is the work-horse of the following solution: #!/usr/local/bin/perl use strict; use warnings; $\| = 1; use Vi::QuickFix; use Algorithm::Graphs::TransitiveClosure qw( floyd_warshall); my %graph; my %nodes; while (<DATA>) { my @next = split; @nodes{ @next} = (); while ( @next > 1 ) { # build refexive and symmetric relation my $top = shift @next; my $below = $next[ 0]; $graph{ $top}->{ $top} = 1; $graph{ $top}->{ $below} = 1; $graph{ $below}->{ $top} = 1; } } # build transitive closure floyd_warshall( \ %graph); # pull off chains my @chains; while ( keys %nodes ) { my $node = each %nodes; my @chain = sort { $a <=> $b } keys %{ $graph{ $node} }; push @chains, \ @chain; delete @nodes{ @chain}; } print "@$_\n" for sort { $a->[ 0] <=> $b->[ 0] } @chains; exit; __DATA__ 1 27 1 32 1 68 ... [download] The `%graph` hash is built much like `%overlays` in your code, except that both ascending and descending pairs are considered related. After the Floyd-Marshall algortithm has built the closure, the groups can be pulled off one by one. The resulting output 1 5 6 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 +30 31 32 33 35 36 37 38 39 40 43 44 45 46 47 48 49 50 52 53 54 56 57 +58 61 65 66 68 72 73 74 75 80 82 84 87 88 89 90 91 92 93 94 95 96 97 +98 99 100 101 102 103 104 106 107 108 109 110 111 112 113 114 116 117 + 119 120 121 122 123 124 125 126 127 129 130 131 132 134 136 137 139 +140 141 145 146 147 148 149 150 151 152 153 154 155 158 159 160 161 1 +62 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 17 +9 180 181 182 183 184 185 186 187 188 189 191 192 193 194 195 196 197 + 198 199 200 201 777 888 999 4 42 51 76 85 86 133 666 34 63 59 71 60 64 67 62 81 83 115 69 78 105 118 [download] shows indeed only one large group which contains the spurious one that appeared in your output. Update: Come to think of it, what you're doing is determine the connected components of a graph. There are algorithms (and CPAN modules) that implement that directly. Using Floyd-Warshall in this way is possible, but probably less than optimal. Anno	[reply] [d/l] [select]
Re^2: Determine group membership for partial sequences by GrandFather (Saint) on Aug 23, 2007 at 23:13 UTC
Yay! Exactly the sort of clean up I was looking for. Thank you. This is a one off task so the fact that there may be more efficient ways of achieving the result than using Floyd-Warshall doesn't really matter. DWIM is Perl's answer to Gödel	[reply]
Re^3: Determine group membership for partial sequences by Anno (Deacon) on Aug 24, 2007 at 08:23 UTC
Describing the result as connected components of a graph may be a way to explain the process in intuitive terms. Anno	[reply]
Re: Determine group membership for partial sequences by Limbic~Region (Chancellor) on Aug 23, 2007 at 14:10 UTC
GrandFather, I have been eyeing your node since you posted it. It seems to me that it is unlikely that there is anything you could do to your code to make it more meaningful to the audience. Even if they were non-perl programmers, the following would require a fair amount of explanation: `my @next = grep $_, split;` [download] Implicit variables like $_ What is truth Scoping and declaration Default parameters to functions I think you would be better off by providing a plain english explanation of the algorithm. That way they can understand and agree with your approach even if they don't understand the code. I am not talking about comments nor even documentation. I think a completely distinct written explanation would be of the most benefit to this audience. Cheers - L~R	[reply] [d/l]
Re^2: Determine group membership for partial sequences by GrandFather (Saint) on Aug 23, 2007 at 18:42 UTC
I agree. I suspect however that the primary author of the paper will tell me that it suffices that the code is available and anyone who is interested has all the information they need to "perform the experiment" themselves. Which is right in line with the laziness virtue. ;) DWIM is Perl's answer to Gödel	[reply]
Re^3: Determine group membership for partial sequences by Limbic~Region (Chancellor) on Aug 23, 2007 at 22:14 UTC
GrandFather, I'm sorry, I meant to include a line that says "For more information, see <link>" because I agree that the explanation of the algorithm should take a back seat in the article. Trouble of posting at PerlMonks when you are supposed to be working. Cheers - L~R	[reply]
Re: Determine group membership for partial sequences by BrowserUk (Patriarch) on Aug 23, 2007 at 19:47 UTC
A written description of what the data represents might help us understand the data and perhaps offer a solution to your described problem--assuming one exists. The best interpretation I could muster from your description is that the data lines: `182 196 777 180 197 777` [download] Means that artifact 182 was found in a higher(or maybe lower, I couldn't decide?) strata than artifact 196,which in turn was in a higher strata than artifact 777. And that a similar relationship exists between the artifacts in the second line. But although artifacts 196 and 197 where definitely both higher than artifact 777, it does not mean they were both at the same level, nor illustrate anything about their relative depths. Same goes for artifacts 182 and 180. So from those two lines alone, any of the following (plus many other), relative positionings is possible: `182 180 182 180 182 180 197 196 196 197 196 182 180 197 196 197 777 777 777 777` [download] I first thought the purpose of the code was to try and 'sort' the artifacts from top to bottom be resolving all the relative positionings, but the form of the output doesn't seem to bear that out. If that were the case, it would appear to put far too many artifacts at one level? My next thought was that the data indicated a physical overlap in the x/y plane. Ie. That some part of artifact 196 physically overlays some part of artifact 777. And the same for 197. And that the purpose was to say that all these artifacts were in one part of the dig, (one spatial grouping), and these were grouped in another part of the dig. And there was no overlap between any of the items within those two (or more) groupings. That seems more likely, but it would be easier to understand if you said up front what the data actually means. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]