in reply to clustering pairs
This is a somewhat tricky problem and the code below may require some careful study to understand! You are quite right that hashes are the key, but even with that some thought is required. Note that the following code tries to keep the clusters in order of the first cluster entry found which complicates things a little more, but not much.
use strict; use warnings; my %idLookup; my %firstLines; while (<DATA>) { chomp; next unless length; my $pair = $_; my ($first, $second) = map {s/.*\.//; $_} split ' '; push @{$idLookup{$first}}, [$second, $pair]; push @{$idLookup{$second}}, [$first, $pair]; $firstLines{$first} = $. unless exists $firstLines{$first}; $firstLines{$second} = $. + 0.5 unless exists $firstLines{$second} +; } my @lineOrder = sort {$firstLines{$a} <=> $firstLines{$b}} keys %first +Lines; my $clusterNum; for my $id (@lineOrder) { next unless exists $firstLines{$id}; my %hits = ($id => undef); my @cluster; while (keys %hits) { my $hit = (keys %hits)[0]; my $list = $idLookup{$hit}; delete $hits{$hit}; next unless exists $firstLines{$hit}; delete $firstLines{$hit}; for my $entry (@$list) { my ($entryId, $pair) = @$entry; next unless exists $firstLines{$entryId}; push @cluster, $pair; $hits{$entryId} = 1; } } print "cluster", ++$clusterNum, "\n"; print join "\n", @cluster, "\n"; } __DATA__ ID5141.C1665 ID5141.C2448 ID5141.C1253 ID5144.C2039 ID5141.C1596 ID5144.C1956 ID5141.C1906 ID5144.C2149 ID5141.C1221 ID5144.C1956 ID5141.C2149 ID5141.C2386 ID5141.C2039 ID5142.C1221 ID5141.C5887 ID5141.C7685 ID5141.C1005 ID5142.C2808 ID5141.C1046 ID5141.C1596 ID5141.C2386 ID5141.C4990 ID5141.C7685 ID5141.C4888
Prints:
cluster1 ID5141.C1665 ID5141.C2448 cluster2 ID5141.C1253 ID5144.C2039 ID5141.C2039 ID5142.C1221 ID5141.C1221 ID5144.C1956 ID5141.C1596 ID5144.C1956 ID5141.C1046 ID5141.C1596 cluster3 ID5141.C1906 ID5144.C2149 ID5141.C2149 ID5141.C2386 ID5141.C2386 ID5141.C4990 cluster4 ID5141.C5887 ID5141.C7685 ID5141.C7685 ID5141.C4888 cluster5 ID5141.C1005 ID5142.C2808
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: clustering pairs
by sugar (Beadle) on Dec 01, 2008 at 05:09 UTC | |
by GrandFather (Saint) on Dec 01, 2008 at 08:38 UTC | |
by sugar (Beadle) on Dec 03, 2008 at 01:55 UTC | |
by GrandFather (Saint) on Dec 03, 2008 at 02:27 UTC | |
by sugar (Beadle) on Dec 03, 2008 at 02:56 UTC | |
|
Re^2: clustering pairs
by SuicideJunkie (Vicar) on Dec 01, 2008 at 16:46 UTC | |
by jdporter (Paladin) on Dec 01, 2008 at 19:44 UTC |