Re: clustering pairs

I found a somewhat different solution...

#!/usr/bin/perl
use strict;
use warnings;

my %lookup_ids_by_line;
my %lookup_lines_by_id;

chomp (my @lines = <DATA>);

foreach my $line (@lines) {

    my ($id1, $id2) = $line =~ m/\.(\S+)/g;

    push(@{$lookup_ids_by_line{$line}}, $id1, $id2);
    push(@{$lookup_lines_by_id{$id1}}, $line);
    push(@{$lookup_lines_by_id{$id2}}, $line);
}

my $cluster_number = 0;

foreach my $line (@lines) {
    next unless(exists($lookup_ids_by_line{$line}));

    $cluster_number++;

    my @lines_in_cluster = ($line);
    my @ids_to_check = @{$lookup_ids_by_line{$line}};

    delete($lookup_ids_by_line{$line});

    while(@ids_to_check) {
        my $id = pop(@ids_to_check);
        next unless(exists($lookup_lines_by_id{$id}));

        foreach my $line (@{$lookup_lines_by_id{$id}}) {

            if(exists($lookup_ids_by_line{$line})) {
                push(@lines_in_cluster, $line);
                push(@ids_to_check, @{$lookup_ids_by_line{$line}});

                delete($lookup_ids_by_line{$line});
            }
        }

        delete($lookup_lines_by_id{$id});
    }
    print "cluster number $cluster_number:\n\t" .
        join("\n\t",@lines_in_cluster) . "\n";
}

__DATA__
ID5141.C1665 ID5141.C2448
ID5141.C1253 ID5144.C2039
ID5141.C1596 ID5144.C1956
ID5141.C1906 ID5144.C2149
ID5141.C1221 ID5144.C1956
ID5141.C2149 ID5141.C2386
ID5141.C2039 ID5142.C1221
ID5141.C5887 ID5141.C7685
ID5141.C1005 ID5142.C2808
ID5141.C1046 ID5141.C1596
ID5141.C2386 ID5141.C4990
ID5141.C7685 ID5141.C4888
[download]

Update: I was curious how difficult it would be to process the lines in a single pass, and found the following solution. This produces the lines in the clusters in the same order they appeared in the input, which is different from the order you listed them in your original post. If I don't sort to keep the original order, they come out in yet another order.

Update2: Fixed the sort so that the lines truly come out in the order they appeared in the input.

#!/usr/bin/perl
use strict;
use warnings;

my @clusters;

chomp (my @lines = <DATA>);


foreach my $line (0..$#lines) {

    my @ids = $lines[$line] =~ m/\.(\S+)/g;

    my $in_cluster;

    foreach my $cluster (0..$#clusters) {
        if ( map { my $id = $_; grep { $id eq $_ } @ids } @{$clusters[
+$cluster]->{ids}} ) {
            if(defined($in_cluster)) {
                push(@{$clusters[$in_cluster]->{lines}}, @{$clusters[$
+cluster]->{lines}});
                push(@{$clusters[$in_cluster]->{ids}}, @{$clusters[$cl
+uster]->{ids}});
                splice(@clusters,$cluster,1);
                last;
            } else {
                $in_cluster = $cluster;
                push(@{$clusters[$cluster]->{lines}}, $line);
                push(@{$clusters[$cluster]->{ids}}, @ids);
            }
        }
    }
    unless(defined($in_cluster)) {
        my $cluster = {};
        push(@{$cluster->{lines}}, $line);
        push(@{$cluster->{ids}}, @ids);
        push(@clusters, $cluster);
    }
}


foreach my $cluster (0..$#clusters) {
    print "Cluster number " . ($cluster + 1) . ":\n";
    foreach my $line (sort { $a <=> $b } @{$clusters[$cluster]->{lines
+}}) {
        print "\t$lines[$line]\n";
    }
    print "\n";
}

exit(0);

__DATA__
ID5141.C1665 ID5141.C2448
ID5141.C1253 ID5144.C2039
ID5141.C1596 ID5144.C1956
ID5141.C1906 ID5144.C2149
ID5141.C1221 ID5144.C1956
ID5141.C2149 ID5141.C2386
ID5141.C2039 ID5142.C1221
ID5141.C5887 ID5141.C7685
ID5141.C1005 ID5142.C2808
ID5141.C1046 ID5141.C1596
ID5141.C2386 ID5141.C4990
ID5141.C7685 ID5141.C4888
[download]

Comment on Re: clustering pairs Select or Download Code

Replies are listed 'Best First'.
Re^2: clustering pairs by sugar (Beadle) on Dec 01, 2008 at 07:07 UTC
thank u :) i have already started learning hashes in perl now :)	[reply]
Re^2: clustering pairs by sugar (Beadle) on Dec 03, 2008 at 01:58 UTC
i tried fixing it using arrays :) #!/usr/bin/perl open FILE,"sampledata"; @arr = <FILE>; chomp @arr; close(FILE); local $,="\n"; while(@arr) { my @reslt; @str = shift @arr; push(@reslt,$str[0]); while (@str) { $flag = 0; $str = shift @str; $s1,$s2,$flag) = split(/ /,$str); my $count = -1; my $acount = 0; #to arrange o/p foreach(@arr) { $count++; if($_ =~ /$s1\|$s2/) { $acount++; if($acount == 2 \|\| $flag == 1) { unshift(@reslt,$_); unshift(@str,$_." 1"); } else { push(@reslt,$_); push(@str,$_); } splice(@arr,$count,1); } } } print @reslt,"\n"; } [download]	[reply] [d/l]
Re^3: clustering pairs by ig (Vicar) on Dec 05, 2008 at 04:36 UTC
You have almost arrived at yet another solution to your problem. You have made good progress from your initial posting. I agree with GrandFather's comments. Also, note the warning regarding modification of the list a foreach loop is iterating over, in perlsyn If any part of LIST is an array, "foreach" will get very confused if you add or remove elements within the loop body, for example with "splice". So don’t do that. Your "ID" is not compatible with the clusters you gave in your original post, which will only be produced if you ignore the characters preceding and including the period in each half of the string. These issues are addressed somewhat in the following: Read more... (2 kB)	[reply] [d/l]