in reply to clustering pairs

This is a somewhat tricky problem and the code below may require some careful study to understand! You are quite right that hashes are the key, but even with that some thought is required. Note that the following code tries to keep the clusters in order of the first cluster entry found which complicates things a little more, but not much.

use strict; use warnings; my %idLookup; my %firstLines; while (<DATA>) { chomp; next unless length; my $pair = $_; my ($first, $second) = map {s/.*\.//; $_} split ' '; push @{$idLookup{$first}}, [$second, $pair]; push @{$idLookup{$second}}, [$first, $pair]; $firstLines{$first} = $. unless exists $firstLines{$first}; $firstLines{$second} = $. + 0.5 unless exists $firstLines{$second} +; } my @lineOrder = sort {$firstLines{$a} <=> $firstLines{$b}} keys %first +Lines; my $clusterNum; for my $id (@lineOrder) { next unless exists $firstLines{$id}; my %hits = ($id => undef); my @cluster; while (keys %hits) { my $hit = (keys %hits)[0]; my $list = $idLookup{$hit}; delete $hits{$hit}; next unless exists $firstLines{$hit}; delete $firstLines{$hit}; for my $entry (@$list) { my ($entryId, $pair) = @$entry; next unless exists $firstLines{$entryId}; push @cluster, $pair; $hits{$entryId} = 1; } } print "cluster", ++$clusterNum, "\n"; print join "\n", @cluster, "\n"; } __DATA__ ID5141.C1665 ID5141.C2448 ID5141.C1253 ID5144.C2039 ID5141.C1596 ID5144.C1956 ID5141.C1906 ID5144.C2149 ID5141.C1221 ID5144.C1956 ID5141.C2149 ID5141.C2386 ID5141.C2039 ID5142.C1221 ID5141.C5887 ID5141.C7685 ID5141.C1005 ID5142.C2808 ID5141.C1046 ID5141.C1596 ID5141.C2386 ID5141.C4990 ID5141.C7685 ID5141.C4888

Prints:

cluster1 ID5141.C1665 ID5141.C2448 cluster2 ID5141.C1253 ID5144.C2039 ID5141.C2039 ID5142.C1221 ID5141.C1221 ID5144.C1956 ID5141.C1596 ID5144.C1956 ID5141.C1046 ID5141.C1596 cluster3 ID5141.C1906 ID5144.C2149 ID5141.C2149 ID5141.C2386 ID5141.C2386 ID5141.C4990 cluster4 ID5141.C5887 ID5141.C7685 ID5141.C7685 ID5141.C4888 cluster5 ID5141.C1005 ID5142.C2808

Perl's payment curve coincides with its learning curve.

Replies are listed 'Best First'.
Re^2: clustering pairs
by sugar (Beadle) on Dec 01, 2008 at 05:09 UTC
    genius :) still i am trying to understand your program. Its very complicated.if you dont mind can u include comments on how it works. so that, i can learn, understand and also be able to modify the program to add features if needed later. Thank you so much for your guidance. hashes can do wonders i guess. i should start learning them soon :)

      I've provided a copy of the code below with a few comments. Some stuff you should consult the Perl documentation for - learning to do that is probably more valuable than anything I can convey in a short reply. Start at perl.

      # Always use strictures! use strict; use warnings; # Use nice names for variables. There is no such thing as saving time # using short variable names! However, don't use excessively long # names either. Naming things is an important part of programming! my %idLookup; my %firstLines; while (<DATA>) { chomp; # not chop! Remove line end from $_ next unless length; # Skip blank lines my $pair = $_; # Pull out the id strings my ($first, $second) = map {s/.*\.//; $_} split ' '; # idLookup is a hash of arrays. Push one id onto the array # accessed using the other id as the hash key push @{$idLookup{$first}}, [$second, $pair]; push @{$idLookup{$second}}, [$first, $pair]; # Build a hash keyed by id containing line numbers ($. is current +line number $firstLines{$first} = $. unless exists $firstLines{$first}; $firstLines{$second} = $. + 0.5 unless exists $firstLines{$second} +; } # Build a list of ids ordered by line number my @lineOrder = sort {$firstLines{$a} <=> $firstLines{$b}} keys %first +Lines; my $clusterNum; # Generate clusters in order of first occurrence of ids for my $id (@lineOrder) { # next if already allocated id to a cluster next unless exists $firstLines{$id}; my %hits = ($id => undef); # hash of ids in current cluster to pro +cess my @cluster; while (keys %hits) { my $hit = (keys %hits)[0]; # Get an id to process my $list = $idLookup{$hit}; # Get list of other ids current on +e is matched with delete $hits{$hit}; # Remove current id from the hit list # next if already processed the id next unless exists $firstLines{$hit}; delete $firstLines{$hit}; # Remove current id to indicate it's + processed for my $entry (@$list) { # Add matched ids to cluster my ($entryId, $pair) = @$entry; # next (don't add) if processed already next unless exists $firstLines{$entryId}; push @cluster, $pair; # Add the original line to the clust +er $hits{$entryId} = 1; # Add the matched id as a hit } } # print the cluster print "cluster", ++$clusterNum, "\n"; # Header line print join "\n", @cluster, "\n"; # Original lines }

      Perl's payment curve coincides with its learning curve.
        Thanks a lot, that was of a great help with the comments explaining in detail...Just thought, y not possible with arrays. Check out the program below :)
        #!/usr/bin/perl open FILE,"sampledata"; @arr = <FILE>; chomp @arr; close(FILE); local $,="\n"; while(@arr) { my @reslt; @str = shift @arr; push(@reslt,$str[0]); while (@str) { $flag = 0; $str = shift @str; $s1,$s2,$flag) = split(/ /,$str); my $count = -1; my $acount = 0; #to arrange o/p foreach(@arr) { $count++; if($_ =~ /$s1|$s2/) { $acount++; if($acount == 2 || $flag == 1) { unshift(@reslt,$_); unshift(@str,$_." 1"); } else { push(@reslt,$_); push(@str,$_); } splice(@arr,$count,1); } } } print @reslt,"\n"; }
Re^2: clustering pairs
by SuicideJunkie (Vicar) on Dec 01, 2008 at 16:46 UTC
    ID5144.C2039 ID5141.C2039 ID5142.C1221 my ($first, $second) = map {s/.*\.//; $_} split ' ';

    Are you deleting the first half of the ID?
    While the first column contains only "ID5141"'s the second column does contain unique prefixes that are probably important to the problem.

    ID5144.C2039 is different from ID5141.C2039, for example.

      Looking at what the OP considers to be valid clusters, it appears that only the second part of each ID (C\d+) is considered when determining whether two items are "equal"; the first part (ID\d+) is ignored. Of course, the entire item must be remembered for when it is output again.