Re: clustering pairs

This is a somewhat tricky problem and the code below may require some careful study to understand! You are quite right that hashes are the key, but even with that some thought is required. Note that the following code tries to keep the clusters in order of the first cluster entry found which complicates things a little more, but not much.

use strict;
use warnings;

my %idLookup;
my %firstLines;

while (<DATA>) {
    chomp;
    next unless length;

    my $pair = $_;
    my ($first, $second) = map {s/.*\.//; $_} split ' ';

    push @{$idLookup{$first}}, [$second, $pair];
    push @{$idLookup{$second}}, [$first, $pair];
    $firstLines{$first} = $. unless exists $firstLines{$first};
    $firstLines{$second} = $. + 0.5 unless exists $firstLines{$second}
+;
}

my @lineOrder = sort {$firstLines{$a} <=> $firstLines{$b}} keys %first
+Lines;
my $clusterNum;

for my $id (@lineOrder) {
    next unless exists $firstLines{$id};

    my %hits = ($id => undef);
    my @cluster;

    while (keys %hits) {
        my $hit = (keys %hits)[0];
        my $list = $idLookup{$hit};

        delete $hits{$hit};

        next unless exists $firstLines{$hit};
        delete $firstLines{$hit};

        for my $entry (@$list) {
            my ($entryId, $pair) = @$entry;

            next unless exists $firstLines{$entryId};
            push @cluster, $pair;
            $hits{$entryId} = 1;
        }
    }

    print "cluster", ++$clusterNum, "\n";
    print join "\n", @cluster, "\n";
}


__DATA__
ID5141.C1665 ID5141.C2448
ID5141.C1253 ID5144.C2039
ID5141.C1596 ID5144.C1956
ID5141.C1906 ID5144.C2149
ID5141.C1221 ID5144.C1956
ID5141.C2149 ID5141.C2386
ID5141.C2039 ID5142.C1221
ID5141.C5887 ID5141.C7685
ID5141.C1005 ID5142.C2808
ID5141.C1046 ID5141.C1596
ID5141.C2386 ID5141.C4990
ID5141.C7685 ID5141.C4888
[download]

Prints:

cluster1
ID5141.C1665 ID5141.C2448

cluster2
ID5141.C1253 ID5144.C2039
ID5141.C2039 ID5142.C1221
ID5141.C1221 ID5144.C1956
ID5141.C1596 ID5144.C1956
ID5141.C1046 ID5141.C1596

cluster3
ID5141.C1906 ID5144.C2149
ID5141.C2149 ID5141.C2386
ID5141.C2386 ID5141.C4990

cluster4
ID5141.C5887 ID5141.C7685
ID5141.C7685 ID5141.C4888

cluster5
ID5141.C1005 ID5142.C2808
[download]

Perl's payment curve coincides with its learning curve.

Comment on Re: clustering pairs Select or Download Code

Replies are listed 'Best First'.
Re^2: clustering pairs by sugar (Beadle) on Dec 01, 2008 at 05:09 UTC
genius :) still i am trying to understand your program. Its very complicated.if you dont mind can u include comments on how it works. so that, i can learn, understand and also be able to modify the program to add features if needed later. Thank you so much for your guidance. hashes can do wonders i guess. i should start learning them soon :)	[reply]
Re^3: clustering pairs by GrandFather (Saint) on Dec 01, 2008 at 08:38 UTC
I've provided a copy of the code below with a few comments. Some stuff you should consult the Perl documentation for - learning to do that is probably more valuable than anything I can convey in a short reply. Start at perl. # Always use strictures! use strict; use warnings; # Use nice names for variables. There is no such thing as saving time # using short variable names! However, don't use excessively long # names either. Naming things is an important part of programming! my %idLookup; my %firstLines; while (<DATA>) { chomp; # not chop! Remove line end from $_ next unless length; # Skip blank lines my $pair = $_; # Pull out the id strings my ($first, $second) = map {s/.*\.//; $_} split ' '; # idLookup is a hash of arrays. Push one id onto the array # accessed using the other id as the hash key push @{$idLookup{$first}}, [$second, $pair]; push @{$idLookup{$second}}, [$first, $pair]; # Build a hash keyed by id containing line numbers ($. is current +line number $firstLines{$first} = $. unless exists $firstLines{$first}; $firstLines{$second} = $. + 0.5 unless exists $firstLines{$second} +; } # Build a list of ids ordered by line number my @lineOrder = sort {$firstLines{$a} <=> $firstLines{$b}} keys %first +Lines; my $clusterNum; # Generate clusters in order of first occurrence of ids for my $id (@lineOrder) { # next if already allocated id to a cluster next unless exists $firstLines{$id}; my %hits = ($id => undef); # hash of ids in current cluster to pro +cess my @cluster; while (keys %hits) { my $hit = (keys %hits)[0]; # Get an id to process my $list = $idLookup{$hit}; # Get list of other ids current on +e is matched with delete $hits{$hit}; # Remove current id from the hit list # next if already processed the id next unless exists $firstLines{$hit}; delete $firstLines{$hit}; # Remove current id to indicate it's + processed for my $entry (@$list) { # Add matched ids to cluster my ($entryId, $pair) = @$entry; # next (don't add) if processed already next unless exists $firstLines{$entryId}; push @cluster, $pair; # Add the original line to the clust +er $hits{$entryId} = 1; # Add the matched id as a hit } } # print the cluster print "cluster", ++$clusterNum, "\n"; # Header line print join "\n", @cluster, "\n"; # Original lines } [download] Perl's payment curve coincides with its learning curve.	[reply] [d/l]
Re^4: clustering pairs by sugar (Beadle) on Dec 03, 2008 at 01:55 UTC
Thanks a lot, that was of a great help with the comments explaining in detail...Just thought, y not possible with arrays. Check out the program below :) #!/usr/bin/perl open FILE,"sampledata"; @arr = <FILE>; chomp @arr; close(FILE); local $,="\n"; while(@arr) { my @reslt; @str = shift @arr; push(@reslt,$str[0]); while (@str) { $flag = 0; $str = shift @str; $s1,$s2,$flag) = split(/ /,$str); my $count = -1; my $acount = 0; #to arrange o/p foreach(@arr) { $count++; if($_ =~ /$s1\|$s2/) { $acount++; if($acount == 2 \|\| $flag == 1) { unshift(@reslt,$_); unshift(@str,$_." 1"); } else { push(@reslt,$_); push(@str,$_); } splice(@arr,$count,1); } } } print @reslt,"\n"; } [download]	[reply] [d/l]
Re^5: clustering pairs by GrandFather (Saint) on Dec 03, 2008 at 02:27 UTC
Re^6: clustering pairs by sugar (Beadle) on Dec 03, 2008 at 02:56 UTC
Re^2: clustering pairs by SuicideJunkie (Vicar) on Dec 01, 2008 at 16:46 UTC
`ID5144.C2039 ID5141.C2039 ID5142.C1221 my ($first, $second) = map {s/.*\.//; $_} split ' ';` [download] Are you deleting the first half of the ID? While the first column contains only "ID5141"'s the second column does contain unique prefixes that are probably important to the problem. `ID5144.C2039` is different from `ID5141.C2039`, for example.	[reply] [d/l] [select]
Re^3: clustering pairs by jdporter (Paladin) on Dec 01, 2008 at 19:44 UTC
Looking at what the OP considers to be valid clusters, it appears that only the second part of each ID (`C\d+`) is considered when determining whether two items are "equal"; the first part (`ID\d+`) is ignored. Of course, the entire item must be remembered for when it is output again.	[reply] [d/l] [select]