clustering pairs

sugar has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: clustering pairs by GrandFather (Saint) on Dec 01, 2008 at 04:04 UTC
This is a somewhat tricky problem and the code below may require some careful study to understand! You are quite right that hashes are the key, but even with that some thought is required. Note that the following code tries to keep the clusters in order of the first cluster entry found which complicates things a little more, but not much. use strict; use warnings; my %idLookup; my %firstLines; while (<DATA>) { chomp; next unless length; my $pair = $_; my ($first, $second) = map {s/.*\.//; $_} split ' '; push @{$idLookup{$first}}, [$second, $pair]; push @{$idLookup{$second}}, [$first, $pair]; $firstLines{$first} = $. unless exists $firstLines{$first}; $firstLines{$second} = $. + 0.5 unless exists $firstLines{$second} +; } my @lineOrder = sort {$firstLines{$a} <=> $firstLines{$b}} keys %first +Lines; my $clusterNum; for my $id (@lineOrder) { next unless exists $firstLines{$id}; my %hits = ($id => undef); my @cluster; while (keys %hits) { my $hit = (keys %hits)[0]; my $list = $idLookup{$hit}; delete $hits{$hit}; next unless exists $firstLines{$hit}; delete $firstLines{$hit}; for my $entry (@$list) { my ($entryId, $pair) = @$entry; next unless exists $firstLines{$entryId}; push @cluster, $pair; $hits{$entryId} = 1; } } print "cluster", ++$clusterNum, "\n"; print join "\n", @cluster, "\n"; } __DATA__ ID5141.C1665 ID5141.C2448 ID5141.C1253 ID5144.C2039 ID5141.C1596 ID5144.C1956 ID5141.C1906 ID5144.C2149 ID5141.C1221 ID5144.C1956 ID5141.C2149 ID5141.C2386 ID5141.C2039 ID5142.C1221 ID5141.C5887 ID5141.C7685 ID5141.C1005 ID5142.C2808 ID5141.C1046 ID5141.C1596 ID5141.C2386 ID5141.C4990 ID5141.C7685 ID5141.C4888 [download] Prints: `cluster1 ID5141.C1665 ID5141.C2448 cluster2 ID5141.C1253 ID5144.C2039 ID5141.C2039 ID5142.C1221 ID5141.C1221 ID5144.C1956 ID5141.C1596 ID5144.C1956 ID5141.C1046 ID5141.C1596 cluster3 ID5141.C1906 ID5144.C2149 ID5141.C2149 ID5141.C2386 ID5141.C2386 ID5141.C4990 cluster4 ID5141.C5887 ID5141.C7685 ID5141.C7685 ID5141.C4888 cluster5 ID5141.C1005 ID5142.C2808` [download] Perl's payment curve coincides with its learning curve.	[reply] [d/l] [select]
Re^2: clustering pairs by sugar (Beadle) on Dec 01, 2008 at 05:09 UTC
genius :) still i am trying to understand your program. Its very complicated.if you dont mind can u include comments on how it works. so that, i can learn, understand and also be able to modify the program to add features if needed later. Thank you so much for your guidance. hashes can do wonders i guess. i should start learning them soon :)	[reply]
Re^3: clustering pairs by GrandFather (Saint) on Dec 01, 2008 at 08:38 UTC
I've provided a copy of the code below with a few comments. Some stuff you should consult the Perl documentation for - learning to do that is probably more valuable than anything I can convey in a short reply. Start at perl. # Always use strictures! use strict; use warnings; # Use nice names for variables. There is no such thing as saving time # using short variable names! However, don't use excessively long # names either. Naming things is an important part of programming! my %idLookup; my %firstLines; while (<DATA>) { chomp; # not chop! Remove line end from $_ next unless length; # Skip blank lines my $pair = $_; # Pull out the id strings my ($first, $second) = map {s/.*\.//; $_} split ' '; # idLookup is a hash of arrays. Push one id onto the array # accessed using the other id as the hash key push @{$idLookup{$first}}, [$second, $pair]; push @{$idLookup{$second}}, [$first, $pair]; # Build a hash keyed by id containing line numbers ($. is current +line number $firstLines{$first} = $. unless exists $firstLines{$first}; $firstLines{$second} = $. + 0.5 unless exists $firstLines{$second} +; } # Build a list of ids ordered by line number my @lineOrder = sort {$firstLines{$a} <=> $firstLines{$b}} keys %first +Lines; my $clusterNum; # Generate clusters in order of first occurrence of ids for my $id (@lineOrder) { # next if already allocated id to a cluster next unless exists $firstLines{$id}; my %hits = ($id => undef); # hash of ids in current cluster to pro +cess my @cluster; while (keys %hits) { my $hit = (keys %hits)[0]; # Get an id to process my $list = $idLookup{$hit}; # Get list of other ids current on +e is matched with delete $hits{$hit}; # Remove current id from the hit list # next if already processed the id next unless exists $firstLines{$hit}; delete $firstLines{$hit}; # Remove current id to indicate it's + processed for my $entry (@$list) { # Add matched ids to cluster my ($entryId, $pair) = @$entry; # next (don't add) if processed already next unless exists $firstLines{$entryId}; push @cluster, $pair; # Add the original line to the clust +er $hits{$entryId} = 1; # Add the matched id as a hit } } # print the cluster print "cluster", ++$clusterNum, "\n"; # Header line print join "\n", @cluster, "\n"; # Original lines } [download] Perl's payment curve coincides with its learning curve.	[reply] [d/l]
Re^4: clustering pairs by sugar (Beadle) on Dec 03, 2008 at 01:55 UTC
Re^5: clustering pairs by GrandFather (Saint) on Dec 03, 2008 at 02:27 UTC
Some notes below your chosen depth have not been shown here
Re^2: clustering pairs by SuicideJunkie (Vicar) on Dec 01, 2008 at 16:46 UTC
`ID5144.C2039 ID5141.C2039 ID5142.C1221 my ($first, $second) = map {s/.*\.//; $_} split ' ';` [download] Are you deleting the first half of the ID? While the first column contains only "ID5141"'s the second column does contain unique prefixes that are probably important to the problem. `ID5144.C2039` is different from `ID5141.C2039`, for example.	[reply] [d/l] [select]
Re^3: clustering pairs by jdporter (Paladin) on Dec 01, 2008 at 19:44 UTC
Looking at what the OP considers to be valid clusters, it appears that only the second part of each ID (`C\d+`) is considered when determining whether two items are "equal"; the first part (`ID\d+`) is ignored. Of course, the entire item must be remembered for when it is output again.	[reply] [d/l] [select]
Re: clustering pairs by ig (Vicar) on Dec 01, 2008 at 05:25 UTC
I found a somewhat different solution... #!/usr/bin/perl use strict; use warnings; my %lookup_ids_by_line; my %lookup_lines_by_id; chomp (my @lines = <DATA>); foreach my $line (@lines) { my ($id1, $id2) = $line =~ m/\.(\S+)/g; push(@{$lookup_ids_by_line{$line}}, $id1, $id2); push(@{$lookup_lines_by_id{$id1}}, $line); push(@{$lookup_lines_by_id{$id2}}, $line); } my $cluster_number = 0; foreach my $line (@lines) { next unless(exists($lookup_ids_by_line{$line})); $cluster_number++; my @lines_in_cluster = ($line); my @ids_to_check = @{$lookup_ids_by_line{$line}}; delete($lookup_ids_by_line{$line}); while(@ids_to_check) { my $id = pop(@ids_to_check); next unless(exists($lookup_lines_by_id{$id})); foreach my $line (@{$lookup_lines_by_id{$id}}) { if(exists($lookup_ids_by_line{$line})) { push(@lines_in_cluster, $line); push(@ids_to_check, @{$lookup_ids_by_line{$line}}); delete($lookup_ids_by_line{$line}); } } delete($lookup_lines_by_id{$id}); } print "cluster number $cluster_number:\n\t" . join("\n\t",@lines_in_cluster) . "\n"; } __DATA__ ID5141.C1665 ID5141.C2448 ID5141.C1253 ID5144.C2039 ID5141.C1596 ID5144.C1956 ID5141.C1906 ID5144.C2149 ID5141.C1221 ID5144.C1956 ID5141.C2149 ID5141.C2386 ID5141.C2039 ID5142.C1221 ID5141.C5887 ID5141.C7685 ID5141.C1005 ID5142.C2808 ID5141.C1046 ID5141.C1596 ID5141.C2386 ID5141.C4990 ID5141.C7685 ID5141.C4888 [download] Update: I was curious how difficult it would be to process the lines in a single pass, and found the following solution. This produces the lines in the clusters in the same order they appeared in the input, which is different from the order you listed them in your original post. If I don't sort to keep the original order, they come out in yet another order. Update2: Fixed the sort so that the lines truly come out in the order they appeared in the input. #!/usr/bin/perl use strict; use warnings; my @clusters; chomp (my @lines = <DATA>); foreach my $line (0..$#lines) { my @ids = $lines[$line] =~ m/\.(\S+)/g; my $in_cluster; foreach my $cluster (0..$#clusters) { if ( map { my $id = $_; grep { $id eq $_ } @ids } @{$clusters[ +$cluster]->{ids}} ) { if(defined($in_cluster)) { push(@{$clusters[$in_cluster]->{lines}}, @{$clusters[$ +cluster]->{lines}}); push(@{$clusters[$in_cluster]->{ids}}, @{$clusters[$cl +uster]->{ids}}); splice(@clusters,$cluster,1); last; } else { $in_cluster = $cluster; push(@{$clusters[$cluster]->{lines}}, $line); push(@{$clusters[$cluster]->{ids}}, @ids); } } } unless(defined($in_cluster)) { my $cluster = {}; push(@{$cluster->{lines}}, $line); push(@{$cluster->{ids}}, @ids); push(@clusters, $cluster); } } foreach my $cluster (0..$#clusters) { print "Cluster number " . ($cluster + 1) . ":\n"; foreach my $line (sort { $a <=> $b } @{$clusters[$cluster]->{lines +}}) { print "\t$lines[$line]\n"; } print "\n"; } exit(0); __DATA__ ID5141.C1665 ID5141.C2448 ID5141.C1253 ID5144.C2039 ID5141.C1596 ID5144.C1956 ID5141.C1906 ID5144.C2149 ID5141.C1221 ID5144.C1956 ID5141.C2149 ID5141.C2386 ID5141.C2039 ID5142.C1221 ID5141.C5887 ID5141.C7685 ID5141.C1005 ID5142.C2808 ID5141.C1046 ID5141.C1596 ID5141.C2386 ID5141.C4990 ID5141.C7685 ID5141.C4888 [download]	[reply] [d/l] [select]
Re^2: clustering pairs by sugar (Beadle) on Dec 01, 2008 at 07:07 UTC
thank u :) i have already started learning hashes in perl now :)	[reply]
Re^2: clustering pairs by sugar (Beadle) on Dec 03, 2008 at 01:58 UTC
i tried fixing it using arrays :) #!/usr/bin/perl open FILE,"sampledata"; @arr = <FILE>; chomp @arr; close(FILE); local $,="\n"; while(@arr) { my @reslt; @str = shift @arr; push(@reslt,$str[0]); while (@str) { $flag = 0; $str = shift @str; $s1,$s2,$flag) = split(/ /,$str); my $count = -1; my $acount = 0; #to arrange o/p foreach(@arr) { $count++; if($_ =~ /$s1\|$s2/) { $acount++; if($acount == 2 \|\| $flag == 1) { unshift(@reslt,$_); unshift(@str,$_." 1"); } else { push(@reslt,$_); push(@str,$_); } splice(@arr,$count,1); } } } print @reslt,"\n"; } [download]	[reply] [d/l]
Re^3: clustering pairs by ig (Vicar) on Dec 05, 2008 at 04:36 UTC
You have almost arrived at yet another solution to your problem. You have made good progress from your initial posting. I agree with GrandFather's comments. Also, note the warning regarding modification of the list a foreach loop is iterating over, in perlsyn If any part of LIST is an array, "foreach" will get very confused if you add or remove elements within the loop body, for example with "splice". So don�t do that. Your "ID" is not compatible with the clusters you gave in your original post, which will only be produced if you ignore the characters preceding and including the period in each half of the string. These issues are addressed somewhat in the following: Read more... (2 kB)	[reply] [d/l]
Re: clustering pairs by hangon (Deacon) on Dec 01, 2008 at 09:31 UTC
This one deconstructs the data array as it finds matches. When the current cluster runs out of matches (inner loop) it starts over with a new cluster (outer loop) until the array is empty. use strict; use warnings; my @data = <DATA>; chomp @data; my $cluster = 1; while (@data){ my $pair = shift @data; print "\ncluster$cluster\n$pair\n"; my (undef, $id1, undef, $id2) = split /[. ]/, $pair; my $i = 0; while ($i < @data){ if ($data[$i] =~ /$id1/ \|\| $data[$i] =~ /$id2/){ $pair = splice @data, $i, 1; print "$pair\n"; (undef, $id1, undef, $id2) = split /[. ]/, $pair; $i = 0; }else{ $i++; } } $cluster++; } __DATA__ ID5141.C1665 ID5141.C2448 ID5141.C1253 ID5144.C2039 ID5141.C1596 ID5144.C1956 ID5141.C1906 ID5144.C2149 ID5141.C1221 ID5144.C1956 ID5141.C2149 ID5141.C2386 ID5141.C2039 ID5142.C1221 ID5141.C5887 ID5141.C7685 ID5141.C1005 ID5142.C2808 ID5141.C1046 ID5141.C1596 ID5141.C2386 ID5141.C4990 ID5141.C7685 ID5141.C4888 [download] Update: Although this works for the given sample query, it will only find 0 or 1 match for any id pair. I was thinking of linked lists, such as chains of sectors in a filesystem.	[reply] [d/l]
Re: clustering pairs by jdporter (Paladin) on Dec 01, 2008 at 17:10 UTC
clustering is best left to a module, imho. use Graph; use strict; use warnings; my %v; while (<DATA>) { for my $id ( /(\.C\S+)/g ) { push @{ $v{$id} }, $_; } } my $g = new Graph::Undirected; for my $id ( keys %v ) { my @v = @{ $v{$id} }; $g->add_vertex( $v[0] ) if @v == 1; while ( @v > 1 ) { $g->add_edge( $v[0], $v[1] ); shift @v; } } my @clusters = $g->connected_components; for my $i ( 0 .. $#clusters ) { print "\ncluster".($i+1)."\n"; print for @{$clusters[$i]}; } __DATA__ ID5141.C1665 ID5141.C2448 ID5141.C1253 ID5144.C2039 ID5141.C1596 ID5144.C1956 ID5141.C1906 ID5144.C2149 ID5141.C1221 ID5144.C1956 ID5141.C2149 ID5141.C2386 ID5141.C2039 ID5142.C1221 ID5141.C5887 ID5141.C7685 ID5141.C1005 ID5142.C2808 ID5141.C1046 ID5141.C1596 ID5141.C2386 ID5141.C4990 ID5141.C7685 ID5141.C4888 [download] Between the mind which plans and the hands which build, there must be a mediator... and this mediator must be the heart.	[reply] [d/l]