Re: clustering pairs
by GrandFather (Saint) on Dec 01, 2008 at 04:04 UTC
|
This is a somewhat tricky problem and the code below may require some careful study to understand! You are quite right that hashes are the key, but even with that some thought is required. Note that the following code tries to keep the clusters in order of the first cluster entry found which complicates things a little more, but not much.
use strict;
use warnings;
my %idLookup;
my %firstLines;
while (<DATA>) {
chomp;
next unless length;
my $pair = $_;
my ($first, $second) = map {s/.*\.//; $_} split ' ';
push @{$idLookup{$first}}, [$second, $pair];
push @{$idLookup{$second}}, [$first, $pair];
$firstLines{$first} = $. unless exists $firstLines{$first};
$firstLines{$second} = $. + 0.5 unless exists $firstLines{$second}
+;
}
my @lineOrder = sort {$firstLines{$a} <=> $firstLines{$b}} keys %first
+Lines;
my $clusterNum;
for my $id (@lineOrder) {
next unless exists $firstLines{$id};
my %hits = ($id => undef);
my @cluster;
while (keys %hits) {
my $hit = (keys %hits)[0];
my $list = $idLookup{$hit};
delete $hits{$hit};
next unless exists $firstLines{$hit};
delete $firstLines{$hit};
for my $entry (@$list) {
my ($entryId, $pair) = @$entry;
next unless exists $firstLines{$entryId};
push @cluster, $pair;
$hits{$entryId} = 1;
}
}
print "cluster", ++$clusterNum, "\n";
print join "\n", @cluster, "\n";
}
__DATA__
ID5141.C1665 ID5141.C2448
ID5141.C1253 ID5144.C2039
ID5141.C1596 ID5144.C1956
ID5141.C1906 ID5144.C2149
ID5141.C1221 ID5144.C1956
ID5141.C2149 ID5141.C2386
ID5141.C2039 ID5142.C1221
ID5141.C5887 ID5141.C7685
ID5141.C1005 ID5142.C2808
ID5141.C1046 ID5141.C1596
ID5141.C2386 ID5141.C4990
ID5141.C7685 ID5141.C4888
Prints:
cluster1
ID5141.C1665 ID5141.C2448
cluster2
ID5141.C1253 ID5144.C2039
ID5141.C2039 ID5142.C1221
ID5141.C1221 ID5144.C1956
ID5141.C1596 ID5144.C1956
ID5141.C1046 ID5141.C1596
cluster3
ID5141.C1906 ID5144.C2149
ID5141.C2149 ID5141.C2386
ID5141.C2386 ID5141.C4990
cluster4
ID5141.C5887 ID5141.C7685
ID5141.C7685 ID5141.C4888
cluster5
ID5141.C1005 ID5142.C2808
Perl's payment curve coincides with its learning curve.
| [reply] [d/l] [select] |
|
|
genius :) still i am trying to understand your program. Its very complicated.if you dont mind can u include comments on how it works. so that, i can learn, understand and also be able to modify the program to add features if needed later. Thank you so much for your guidance. hashes can do wonders i guess. i should start learning them soon :)
| [reply] |
|
|
# Always use strictures!
use strict;
use warnings;
# Use nice names for variables. There is no such thing as saving time
# using short variable names! However, don't use excessively long
# names either. Naming things is an important part of programming!
my %idLookup;
my %firstLines;
while (<DATA>) {
chomp; # not chop! Remove line end from $_
next unless length; # Skip blank lines
my $pair = $_;
# Pull out the id strings
my ($first, $second) = map {s/.*\.//; $_} split ' ';
# idLookup is a hash of arrays. Push one id onto the array
# accessed using the other id as the hash key
push @{$idLookup{$first}}, [$second, $pair];
push @{$idLookup{$second}}, [$first, $pair];
# Build a hash keyed by id containing line numbers ($. is current
+line number
$firstLines{$first} = $. unless exists $firstLines{$first};
$firstLines{$second} = $. + 0.5 unless exists $firstLines{$second}
+;
}
# Build a list of ids ordered by line number
my @lineOrder = sort {$firstLines{$a} <=> $firstLines{$b}} keys %first
+Lines;
my $clusterNum;
# Generate clusters in order of first occurrence of ids
for my $id (@lineOrder) {
# next if already allocated id to a cluster
next unless exists $firstLines{$id};
my %hits = ($id => undef); # hash of ids in current cluster to pro
+cess
my @cluster;
while (keys %hits) {
my $hit = (keys %hits)[0]; # Get an id to process
my $list = $idLookup{$hit}; # Get list of other ids current on
+e is matched with
delete $hits{$hit}; # Remove current id from the hit list
# next if already processed the id
next unless exists $firstLines{$hit};
delete $firstLines{$hit}; # Remove current id to indicate it's
+ processed
for my $entry (@$list) {
# Add matched ids to cluster
my ($entryId, $pair) = @$entry;
# next (don't add) if processed already
next unless exists $firstLines{$entryId};
push @cluster, $pair; # Add the original line to the clust
+er
$hits{$entryId} = 1; # Add the matched id as a hit
}
}
# print the cluster
print "cluster", ++$clusterNum, "\n"; # Header line
print join "\n", @cluster, "\n"; # Original lines
}
Perl's payment curve coincides with its learning curve.
| [reply] [d/l] |
|
|
|
|
|
|
|
ID5144.C2039
ID5141.C2039
ID5142.C1221
my ($first, $second) = map {s/.*\.//; $_} split ' ';
Are you deleting the first half of the ID?
While the first column contains only "ID5141"'s the second column does contain unique prefixes that are probably important to the problem.
ID5144.C2039 is different from ID5141.C2039, for example. | [reply] [d/l] [select] |
|
|
Looking at what the OP considers to be valid clusters, it appears that only the second part of each ID (C\d+) is considered when determining whether two items are "equal"; the first part (ID\d+) is ignored. Of course, the entire item must be remembered for when it is output again.
| [reply] [d/l] [select] |
Re: clustering pairs
by ig (Vicar) on Dec 01, 2008 at 05:25 UTC
|
#!/usr/bin/perl
use strict;
use warnings;
my %lookup_ids_by_line;
my %lookup_lines_by_id;
chomp (my @lines = <DATA>);
foreach my $line (@lines) {
my ($id1, $id2) = $line =~ m/\.(\S+)/g;
push(@{$lookup_ids_by_line{$line}}, $id1, $id2);
push(@{$lookup_lines_by_id{$id1}}, $line);
push(@{$lookup_lines_by_id{$id2}}, $line);
}
my $cluster_number = 0;
foreach my $line (@lines) {
next unless(exists($lookup_ids_by_line{$line}));
$cluster_number++;
my @lines_in_cluster = ($line);
my @ids_to_check = @{$lookup_ids_by_line{$line}};
delete($lookup_ids_by_line{$line});
while(@ids_to_check) {
my $id = pop(@ids_to_check);
next unless(exists($lookup_lines_by_id{$id}));
foreach my $line (@{$lookup_lines_by_id{$id}}) {
if(exists($lookup_ids_by_line{$line})) {
push(@lines_in_cluster, $line);
push(@ids_to_check, @{$lookup_ids_by_line{$line}});
delete($lookup_ids_by_line{$line});
}
}
delete($lookup_lines_by_id{$id});
}
print "cluster number $cluster_number:\n\t" .
join("\n\t",@lines_in_cluster) . "\n";
}
__DATA__
ID5141.C1665 ID5141.C2448
ID5141.C1253 ID5144.C2039
ID5141.C1596 ID5144.C1956
ID5141.C1906 ID5144.C2149
ID5141.C1221 ID5144.C1956
ID5141.C2149 ID5141.C2386
ID5141.C2039 ID5142.C1221
ID5141.C5887 ID5141.C7685
ID5141.C1005 ID5142.C2808
ID5141.C1046 ID5141.C1596
ID5141.C2386 ID5141.C4990
ID5141.C7685 ID5141.C4888
Update: I was curious how difficult it would be to process the lines in a single pass, and found the following solution. This produces the lines in the clusters in the same order they appeared in the input, which is different from the order you listed them in your original post. If I don't sort to keep the original order, they come out in yet another order.
Update2: Fixed the sort so that the lines truly come out in the order they appeared in the input.
#!/usr/bin/perl
use strict;
use warnings;
my @clusters;
chomp (my @lines = <DATA>);
foreach my $line (0..$#lines) {
my @ids = $lines[$line] =~ m/\.(\S+)/g;
my $in_cluster;
foreach my $cluster (0..$#clusters) {
if ( map { my $id = $_; grep { $id eq $_ } @ids } @{$clusters[
+$cluster]->{ids}} ) {
if(defined($in_cluster)) {
push(@{$clusters[$in_cluster]->{lines}}, @{$clusters[$
+cluster]->{lines}});
push(@{$clusters[$in_cluster]->{ids}}, @{$clusters[$cl
+uster]->{ids}});
splice(@clusters,$cluster,1);
last;
} else {
$in_cluster = $cluster;
push(@{$clusters[$cluster]->{lines}}, $line);
push(@{$clusters[$cluster]->{ids}}, @ids);
}
}
}
unless(defined($in_cluster)) {
my $cluster = {};
push(@{$cluster->{lines}}, $line);
push(@{$cluster->{ids}}, @ids);
push(@clusters, $cluster);
}
}
foreach my $cluster (0..$#clusters) {
print "Cluster number " . ($cluster + 1) . ":\n";
foreach my $line (sort { $a <=> $b } @{$clusters[$cluster]->{lines
+}}) {
print "\t$lines[$line]\n";
}
print "\n";
}
exit(0);
__DATA__
ID5141.C1665 ID5141.C2448
ID5141.C1253 ID5144.C2039
ID5141.C1596 ID5144.C1956
ID5141.C1906 ID5144.C2149
ID5141.C1221 ID5144.C1956
ID5141.C2149 ID5141.C2386
ID5141.C2039 ID5142.C1221
ID5141.C5887 ID5141.C7685
ID5141.C1005 ID5142.C2808
ID5141.C1046 ID5141.C1596
ID5141.C2386 ID5141.C4990
ID5141.C7685 ID5141.C4888
| [reply] [d/l] [select] |
|
|
thank u :) i have already started learning hashes in perl now :)
| [reply] |
|
|
i tried fixing it using arrays :)
#!/usr/bin/perl
open FILE,"sampledata";
@arr = <FILE>;
chomp @arr;
close(FILE);
local $,="\n";
while(@arr) {
my @reslt;
@str = shift @arr;
push(@reslt,$str[0]);
while (@str) {
$flag = 0;
$str = shift @str;
$s1,$s2,$flag) = split(/ /,$str);
my $count = -1;
my $acount = 0; #to arrange o/p
foreach(@arr) {
$count++;
if($_ =~ /$s1|$s2/) {
$acount++;
if($acount == 2 || $flag == 1) {
unshift(@reslt,$_);
unshift(@str,$_." 1");
}
else {
push(@reslt,$_);
push(@str,$_);
}
splice(@arr,$count,1);
}
}
}
print @reslt,"\n";
}
| [reply] [d/l] |
|
|
You have almost arrived at yet another solution to your problem. You have made good progress from your initial posting.
I agree with GrandFather's comments.
Also, note the warning regarding modification of the list a foreach loop is iterating over, in perlsyn
If any part of LIST is an array, "foreach" will get very confused if you add or remove elements within the loop body, for example with "splice". So don’t do that.
Your "ID" is not compatible with the clusters you gave in your original post, which will only be produced if you ignore the characters preceding and including the period in each half of the string.
These issues are addressed somewhat in the following:
| [reply] [d/l] |
Re: clustering pairs
by hangon (Deacon) on Dec 01, 2008 at 09:31 UTC
|
This one deconstructs the data array as it finds matches. When the current cluster runs out of matches (inner loop) it starts over with a new cluster (outer loop) until the array is empty.
use strict;
use warnings;
my @data = <DATA>;
chomp @data;
my $cluster = 1;
while (@data){
my $pair = shift @data;
print "\ncluster$cluster\n$pair\n";
my (undef, $id1, undef, $id2) = split /[. ]/, $pair;
my $i = 0;
while ($i < @data){
if ($data[$i] =~ /$id1/ || $data[$i] =~ /$id2/){
$pair = splice @data, $i, 1;
print "$pair\n";
(undef, $id1, undef, $id2) = split /[. ]/, $pair;
$i = 0;
}else{
$i++;
}
}
$cluster++;
}
__DATA__
ID5141.C1665 ID5141.C2448
ID5141.C1253 ID5144.C2039
ID5141.C1596 ID5144.C1956
ID5141.C1906 ID5144.C2149
ID5141.C1221 ID5144.C1956
ID5141.C2149 ID5141.C2386
ID5141.C2039 ID5142.C1221
ID5141.C5887 ID5141.C7685
ID5141.C1005 ID5142.C2808
ID5141.C1046 ID5141.C1596
ID5141.C2386 ID5141.C4990
ID5141.C7685 ID5141.C4888
Update: Although this works for the given sample query, it will only find 0 or 1 match for any id pair. I was thinking of linked lists, such as chains of sectors in a filesystem.
| [reply] [d/l] |
Re: clustering pairs
by jdporter (Paladin) on Dec 01, 2008 at 17:10 UTC
|
clustering is best left to a module, imho.
use Graph;
use strict;
use warnings;
my %v;
while (<DATA>)
{
for my $id ( /(\.C\S+)/g )
{
push @{ $v{$id} }, $_;
}
}
my $g = new Graph::Undirected;
for my $id ( keys %v )
{
my @v = @{ $v{$id} };
$g->add_vertex( $v[0] ) if @v == 1;
while ( @v > 1 )
{
$g->add_edge( $v[0], $v[1] );
shift @v;
}
}
my @clusters = $g->connected_components;
for my $i ( 0 .. $#clusters )
{
print "\ncluster".($i+1)."\n";
print for @{$clusters[$i]};
}
__DATA__
ID5141.C1665 ID5141.C2448
ID5141.C1253 ID5144.C2039
ID5141.C1596 ID5144.C1956
ID5141.C1906 ID5144.C2149
ID5141.C1221 ID5144.C1956
ID5141.C2149 ID5141.C2386
ID5141.C2039 ID5142.C1221
ID5141.C5887 ID5141.C7685
ID5141.C1005 ID5142.C2808
ID5141.C1046 ID5141.C1596
ID5141.C2386 ID5141.C4990
ID5141.C7685 ID5141.C4888
Between the mind which plans and the hands which build, there must be a mediator... and this mediator must be the heart.
| [reply] [d/l] |