in reply to Re: clustering pairs
in thread clustering pairs

genius :) still i am trying to understand your program. Its very complicated.if you dont mind can u include comments on how it works. so that, i can learn, understand and also be able to modify the program to add features if needed later. Thank you so much for your guidance. hashes can do wonders i guess. i should start learning them soon :)

Replies are listed 'Best First'.
Re^3: clustering pairs
by GrandFather (Saint) on Dec 01, 2008 at 08:38 UTC

    I've provided a copy of the code below with a few comments. Some stuff you should consult the Perl documentation for - learning to do that is probably more valuable than anything I can convey in a short reply. Start at perl.

    # Always use strictures! use strict; use warnings; # Use nice names for variables. There is no such thing as saving time # using short variable names! However, don't use excessively long # names either. Naming things is an important part of programming! my %idLookup; my %firstLines; while (<DATA>) { chomp; # not chop! Remove line end from $_ next unless length; # Skip blank lines my $pair = $_; # Pull out the id strings my ($first, $second) = map {s/.*\.//; $_} split ' '; # idLookup is a hash of arrays. Push one id onto the array # accessed using the other id as the hash key push @{$idLookup{$first}}, [$second, $pair]; push @{$idLookup{$second}}, [$first, $pair]; # Build a hash keyed by id containing line numbers ($. is current +line number $firstLines{$first} = $. unless exists $firstLines{$first}; $firstLines{$second} = $. + 0.5 unless exists $firstLines{$second} +; } # Build a list of ids ordered by line number my @lineOrder = sort {$firstLines{$a} <=> $firstLines{$b}} keys %first +Lines; my $clusterNum; # Generate clusters in order of first occurrence of ids for my $id (@lineOrder) { # next if already allocated id to a cluster next unless exists $firstLines{$id}; my %hits = ($id => undef); # hash of ids in current cluster to pro +cess my @cluster; while (keys %hits) { my $hit = (keys %hits)[0]; # Get an id to process my $list = $idLookup{$hit}; # Get list of other ids current on +e is matched with delete $hits{$hit}; # Remove current id from the hit list # next if already processed the id next unless exists $firstLines{$hit}; delete $firstLines{$hit}; # Remove current id to indicate it's + processed for my $entry (@$list) { # Add matched ids to cluster my ($entryId, $pair) = @$entry; # next (don't add) if processed already next unless exists $firstLines{$entryId}; push @cluster, $pair; # Add the original line to the clust +er $hits{$entryId} = 1; # Add the matched id as a hit } } # print the cluster print "cluster", ++$clusterNum, "\n"; # Header line print join "\n", @cluster, "\n"; # Original lines }

    Perl's payment curve coincides with its learning curve.
      Thanks a lot, that was of a great help with the comments explaining in detail...Just thought, y not possible with arrays. Check out the program below :)
      #!/usr/bin/perl open FILE,"sampledata"; @arr = <FILE>; chomp @arr; close(FILE); local $,="\n"; while(@arr) { my @reslt; @str = shift @arr; push(@reslt,$str[0]); while (@str) { $flag = 0; $str = shift @str; $s1,$s2,$flag) = split(/ /,$str); my $count = -1; my $acount = 0; #to arrange o/p foreach(@arr) { $count++; if($_ =~ /$s1|$s2/) { $acount++; if($acount == 2 || $flag == 1) { unshift(@reslt,$_); unshift(@str,$_." 1"); } else { push(@reslt,$_); push(@str,$_); } splice(@arr,$count,1); } } } print @reslt,"\n"; }

        Sadly there are a few issues with that code. Some are transcription errors, some are coding convention related, some are bugs and one is at odds with a PerlMonks' convention.

        First the PerlMonks' convention: runnable stand alone code. By removing the file handling and using DATA instead it's easy to make your code stand alone.

        Coding conventions: always use strictures (use strict; use warnings;). Use a consistent indentation style (Perl tends toward K&R with 4 character indents). Use the three parameter version of open and test for errors (open ... or die "... $!\n" by convention). Don't slurp (my @arr = <DATA>;). Use blank lines to break your code up into "paragraphs". Comment tricky stuff (your use of $flag and the appended 1 for example).

        Bugs: it doesn't work! I get two one row clusters then everything else in one cluster. With strictures on there are "Use of uninitialized value" warnings.


        Perl's payment curve coincides with its learning curve.