sugar has asked for the wisdom of the Perl Monks concerning the following question:

SAMPLE QUERY: ID5141.C1665 ID5141.C2448 ID5141.C1253 ID5144.C2039 ID5141.C1596 ID5144.C1956 ID5141.C1906 ID5144.C2149 ID5141.C1221 ID5144.C1956 ID5141.C2149 ID5141.C2386 ID5141.C2039 ID5142.C1221 ID5141.C5887 ID5141.C7685 ID5141.C1005 ID5142.C2808 ID5141.C1046 ID5141.C1596 ID5141.C2386 ID5141.C4990 ID5141.C7685 ID5141.C4888 ---------------------------- SAMPLE RESULT: cluster1 ID5141.C1665 ID5141.C2448 cluster2 ID5141.C1253 ID5144.C2039 ID5141.C2039 ID5142.C1221 ID5141.C1221 ID5144.C1956 ID5141.C1596 ID5144.C1956 ID5141.C1046 ID5141.C1596 cluster3 ID5141.C1906 ID5144.C2149 ID5141.C2149 ID5141.C2386 ID5141.C2386 ID5141.C4990 cluster4 ID5141.C1005 ID5141.C2808 cluster5 ID5141.C5887 ID5141.C7685 ID5141.C7685 ID5141.C4888 ----------------------------
Problem: Each line has a pair of ID's. One of the Id's in every pair may or may not match with other ID's in other lines. If the ID does not match with any other ID's of other lines, it is treated as a separate cluster (example - cluster1 and cluster4 in result file) because the ID's in these clusters are not available else where in the file.But, If one of the ID's from one line matches with any other ID (a 'MATCH') in another line, the clustering starts. A similar ID for the 'MATCH' is looked for in the other lines and continues to find 'MATCH'es until it stops without a 'MATCH' (example - cluster2 , cluster3, and cluster5 in result file). I have been trying to find a way to do this, but in vain. I am a beginner in perl, n i know only to deal with arrays to some extent. Do i need to use hash for this problem? will it help me to solve the problem? Please guide me with the shortest or the simplest way to deal with this problem. The program which i wrote n got stuck in the middle is this. Dont know how to proceed actually :(
#!/usr/bin/perl open(FH,"sample_query") or die "can not open"; @array=<FH>; for($i=0;$i<$#array;$i++) { if($array[$i]=~/$array[$i]/) { $mark=$array[$i]; if($mark=~/$array[$i]/) { print "$array[$i]\n"; } } }
Please guide how to proceed!!!

Replies are listed 'Best First'.
Re: clustering pairs
by GrandFather (Saint) on Dec 01, 2008 at 04:04 UTC

    This is a somewhat tricky problem and the code below may require some careful study to understand! You are quite right that hashes are the key, but even with that some thought is required. Note that the following code tries to keep the clusters in order of the first cluster entry found which complicates things a little more, but not much.

    use strict; use warnings; my %idLookup; my %firstLines; while (<DATA>) { chomp; next unless length; my $pair = $_; my ($first, $second) = map {s/.*\.//; $_} split ' '; push @{$idLookup{$first}}, [$second, $pair]; push @{$idLookup{$second}}, [$first, $pair]; $firstLines{$first} = $. unless exists $firstLines{$first}; $firstLines{$second} = $. + 0.5 unless exists $firstLines{$second} +; } my @lineOrder = sort {$firstLines{$a} <=> $firstLines{$b}} keys %first +Lines; my $clusterNum; for my $id (@lineOrder) { next unless exists $firstLines{$id}; my %hits = ($id => undef); my @cluster; while (keys %hits) { my $hit = (keys %hits)[0]; my $list = $idLookup{$hit}; delete $hits{$hit}; next unless exists $firstLines{$hit}; delete $firstLines{$hit}; for my $entry (@$list) { my ($entryId, $pair) = @$entry; next unless exists $firstLines{$entryId}; push @cluster, $pair; $hits{$entryId} = 1; } } print "cluster", ++$clusterNum, "\n"; print join "\n", @cluster, "\n"; } __DATA__ ID5141.C1665 ID5141.C2448 ID5141.C1253 ID5144.C2039 ID5141.C1596 ID5144.C1956 ID5141.C1906 ID5144.C2149 ID5141.C1221 ID5144.C1956 ID5141.C2149 ID5141.C2386 ID5141.C2039 ID5142.C1221 ID5141.C5887 ID5141.C7685 ID5141.C1005 ID5142.C2808 ID5141.C1046 ID5141.C1596 ID5141.C2386 ID5141.C4990 ID5141.C7685 ID5141.C4888

    Prints:

    cluster1 ID5141.C1665 ID5141.C2448 cluster2 ID5141.C1253 ID5144.C2039 ID5141.C2039 ID5142.C1221 ID5141.C1221 ID5144.C1956 ID5141.C1596 ID5144.C1956 ID5141.C1046 ID5141.C1596 cluster3 ID5141.C1906 ID5144.C2149 ID5141.C2149 ID5141.C2386 ID5141.C2386 ID5141.C4990 cluster4 ID5141.C5887 ID5141.C7685 ID5141.C7685 ID5141.C4888 cluster5 ID5141.C1005 ID5142.C2808

    Perl's payment curve coincides with its learning curve.
      genius :) still i am trying to understand your program. Its very complicated.if you dont mind can u include comments on how it works. so that, i can learn, understand and also be able to modify the program to add features if needed later. Thank you so much for your guidance. hashes can do wonders i guess. i should start learning them soon :)

        I've provided a copy of the code below with a few comments. Some stuff you should consult the Perl documentation for - learning to do that is probably more valuable than anything I can convey in a short reply. Start at perl.

        # Always use strictures! use strict; use warnings; # Use nice names for variables. There is no such thing as saving time # using short variable names! However, don't use excessively long # names either. Naming things is an important part of programming! my %idLookup; my %firstLines; while (<DATA>) { chomp; # not chop! Remove line end from $_ next unless length; # Skip blank lines my $pair = $_; # Pull out the id strings my ($first, $second) = map {s/.*\.//; $_} split ' '; # idLookup is a hash of arrays. Push one id onto the array # accessed using the other id as the hash key push @{$idLookup{$first}}, [$second, $pair]; push @{$idLookup{$second}}, [$first, $pair]; # Build a hash keyed by id containing line numbers ($. is current +line number $firstLines{$first} = $. unless exists $firstLines{$first}; $firstLines{$second} = $. + 0.5 unless exists $firstLines{$second} +; } # Build a list of ids ordered by line number my @lineOrder = sort {$firstLines{$a} <=> $firstLines{$b}} keys %first +Lines; my $clusterNum; # Generate clusters in order of first occurrence of ids for my $id (@lineOrder) { # next if already allocated id to a cluster next unless exists $firstLines{$id}; my %hits = ($id => undef); # hash of ids in current cluster to pro +cess my @cluster; while (keys %hits) { my $hit = (keys %hits)[0]; # Get an id to process my $list = $idLookup{$hit}; # Get list of other ids current on +e is matched with delete $hits{$hit}; # Remove current id from the hit list # next if already processed the id next unless exists $firstLines{$hit}; delete $firstLines{$hit}; # Remove current id to indicate it's + processed for my $entry (@$list) { # Add matched ids to cluster my ($entryId, $pair) = @$entry; # next (don't add) if processed already next unless exists $firstLines{$entryId}; push @cluster, $pair; # Add the original line to the clust +er $hits{$entryId} = 1; # Add the matched id as a hit } } # print the cluster print "cluster", ++$clusterNum, "\n"; # Header line print join "\n", @cluster, "\n"; # Original lines }

        Perl's payment curve coincides with its learning curve.
      ID5144.C2039 ID5141.C2039 ID5142.C1221 my ($first, $second) = map {s/.*\.//; $_} split ' ';

      Are you deleting the first half of the ID?
      While the first column contains only "ID5141"'s the second column does contain unique prefixes that are probably important to the problem.

      ID5144.C2039 is different from ID5141.C2039, for example.

        Looking at what the OP considers to be valid clusters, it appears that only the second part of each ID (C\d+) is considered when determining whether two items are "equal"; the first part (ID\d+) is ignored. Of course, the entire item must be remembered for when it is output again.

Re: clustering pairs
by ig (Vicar) on Dec 01, 2008 at 05:25 UTC

    I found a somewhat different solution...

    #!/usr/bin/perl use strict; use warnings; my %lookup_ids_by_line; my %lookup_lines_by_id; chomp (my @lines = <DATA>); foreach my $line (@lines) { my ($id1, $id2) = $line =~ m/\.(\S+)/g; push(@{$lookup_ids_by_line{$line}}, $id1, $id2); push(@{$lookup_lines_by_id{$id1}}, $line); push(@{$lookup_lines_by_id{$id2}}, $line); } my $cluster_number = 0; foreach my $line (@lines) { next unless(exists($lookup_ids_by_line{$line})); $cluster_number++; my @lines_in_cluster = ($line); my @ids_to_check = @{$lookup_ids_by_line{$line}}; delete($lookup_ids_by_line{$line}); while(@ids_to_check) { my $id = pop(@ids_to_check); next unless(exists($lookup_lines_by_id{$id})); foreach my $line (@{$lookup_lines_by_id{$id}}) { if(exists($lookup_ids_by_line{$line})) { push(@lines_in_cluster, $line); push(@ids_to_check, @{$lookup_ids_by_line{$line}}); delete($lookup_ids_by_line{$line}); } } delete($lookup_lines_by_id{$id}); } print "cluster number $cluster_number:\n\t" . join("\n\t",@lines_in_cluster) . "\n"; } __DATA__ ID5141.C1665 ID5141.C2448 ID5141.C1253 ID5144.C2039 ID5141.C1596 ID5144.C1956 ID5141.C1906 ID5144.C2149 ID5141.C1221 ID5144.C1956 ID5141.C2149 ID5141.C2386 ID5141.C2039 ID5142.C1221 ID5141.C5887 ID5141.C7685 ID5141.C1005 ID5142.C2808 ID5141.C1046 ID5141.C1596 ID5141.C2386 ID5141.C4990 ID5141.C7685 ID5141.C4888

    Update: I was curious how difficult it would be to process the lines in a single pass, and found the following solution. This produces the lines in the clusters in the same order they appeared in the input, which is different from the order you listed them in your original post. If I don't sort to keep the original order, they come out in yet another order.

    Update2: Fixed the sort so that the lines truly come out in the order they appeared in the input.

    #!/usr/bin/perl use strict; use warnings; my @clusters; chomp (my @lines = <DATA>); foreach my $line (0..$#lines) { my @ids = $lines[$line] =~ m/\.(\S+)/g; my $in_cluster; foreach my $cluster (0..$#clusters) { if ( map { my $id = $_; grep { $id eq $_ } @ids } @{$clusters[ +$cluster]->{ids}} ) { if(defined($in_cluster)) { push(@{$clusters[$in_cluster]->{lines}}, @{$clusters[$ +cluster]->{lines}}); push(@{$clusters[$in_cluster]->{ids}}, @{$clusters[$cl +uster]->{ids}}); splice(@clusters,$cluster,1); last; } else { $in_cluster = $cluster; push(@{$clusters[$cluster]->{lines}}, $line); push(@{$clusters[$cluster]->{ids}}, @ids); } } } unless(defined($in_cluster)) { my $cluster = {}; push(@{$cluster->{lines}}, $line); push(@{$cluster->{ids}}, @ids); push(@clusters, $cluster); } } foreach my $cluster (0..$#clusters) { print "Cluster number " . ($cluster + 1) . ":\n"; foreach my $line (sort { $a <=> $b } @{$clusters[$cluster]->{lines +}}) { print "\t$lines[$line]\n"; } print "\n"; } exit(0); __DATA__ ID5141.C1665 ID5141.C2448 ID5141.C1253 ID5144.C2039 ID5141.C1596 ID5144.C1956 ID5141.C1906 ID5144.C2149 ID5141.C1221 ID5144.C1956 ID5141.C2149 ID5141.C2386 ID5141.C2039 ID5142.C1221 ID5141.C5887 ID5141.C7685 ID5141.C1005 ID5142.C2808 ID5141.C1046 ID5141.C1596 ID5141.C2386 ID5141.C4990 ID5141.C7685 ID5141.C4888
      thank u :) i have already started learning hashes in perl now :)
      i tried fixing it using arrays :)
      #!/usr/bin/perl open FILE,"sampledata"; @arr = <FILE>; chomp @arr; close(FILE); local $,="\n"; while(@arr) { my @reslt; @str = shift @arr; push(@reslt,$str[0]); while (@str) { $flag = 0; $str = shift @str; $s1,$s2,$flag) = split(/ /,$str); my $count = -1; my $acount = 0; #to arrange o/p foreach(@arr) { $count++; if($_ =~ /$s1|$s2/) { $acount++; if($acount == 2 || $flag == 1) { unshift(@reslt,$_); unshift(@str,$_." 1"); } else { push(@reslt,$_); push(@str,$_); } splice(@arr,$count,1); } } } print @reslt,"\n"; }

        You have almost arrived at yet another solution to your problem. You have made good progress from your initial posting.

        I agree with GrandFather's comments.

        Also, note the warning regarding modification of the list a foreach loop is iterating over, in perlsyn

        If any part of LIST is an array, "foreach" will get very confused if you add or remove elements within the loop body, for example with "splice". So don’t do that.

        Your "ID" is not compatible with the clusters you gave in your original post, which will only be produced if you ignore the characters preceding and including the period in each half of the string.

        These issues are addressed somewhat in the following:

Re: clustering pairs
by hangon (Deacon) on Dec 01, 2008 at 09:31 UTC

    This one deconstructs the data array as it finds matches. When the current cluster runs out of matches (inner loop) it starts over with a new cluster (outer loop) until the array is empty.

    use strict; use warnings; my @data = <DATA>; chomp @data; my $cluster = 1; while (@data){ my $pair = shift @data; print "\ncluster$cluster\n$pair\n"; my (undef, $id1, undef, $id2) = split /[. ]/, $pair; my $i = 0; while ($i < @data){ if ($data[$i] =~ /$id1/ || $data[$i] =~ /$id2/){ $pair = splice @data, $i, 1; print "$pair\n"; (undef, $id1, undef, $id2) = split /[. ]/, $pair; $i = 0; }else{ $i++; } } $cluster++; } __DATA__ ID5141.C1665 ID5141.C2448 ID5141.C1253 ID5144.C2039 ID5141.C1596 ID5144.C1956 ID5141.C1906 ID5144.C2149 ID5141.C1221 ID5144.C1956 ID5141.C2149 ID5141.C2386 ID5141.C2039 ID5142.C1221 ID5141.C5887 ID5141.C7685 ID5141.C1005 ID5142.C2808 ID5141.C1046 ID5141.C1596 ID5141.C2386 ID5141.C4990 ID5141.C7685 ID5141.C4888

    Update: Although this works for the given sample query, it will only find 0 or 1 match for any id pair. I was thinking of linked lists, such as chains of sectors in a filesystem.

Re: clustering pairs
by jdporter (Paladin) on Dec 01, 2008 at 17:10 UTC

    clustering is best left to a module, imho.

    use Graph; use strict; use warnings; my %v; while (<DATA>) { for my $id ( /(\.C\S+)/g ) { push @{ $v{$id} }, $_; } } my $g = new Graph::Undirected; for my $id ( keys %v ) { my @v = @{ $v{$id} }; $g->add_vertex( $v[0] ) if @v == 1; while ( @v > 1 ) { $g->add_edge( $v[0], $v[1] ); shift @v; } } my @clusters = $g->connected_components; for my $i ( 0 .. $#clusters ) { print "\ncluster".($i+1)."\n"; print for @{$clusters[$i]}; } __DATA__ ID5141.C1665 ID5141.C2448 ID5141.C1253 ID5144.C2039 ID5141.C1596 ID5144.C1956 ID5141.C1906 ID5144.C2149 ID5141.C1221 ID5144.C1956 ID5141.C2149 ID5141.C2386 ID5141.C2039 ID5142.C1221 ID5141.C5887 ID5141.C7685 ID5141.C1005 ID5142.C2808 ID5141.C1046 ID5141.C1596 ID5141.C2386 ID5141.C4990 ID5141.C7685 ID5141.C4888
    Between the mind which plans and the hands which build, there must be a mediator... and this mediator must be the heart.