I've provided a copy of the code below with a few comments. Some stuff you should consult the Perl documentation for - learning to do that is probably more valuable than anything I can convey in a short reply. Start at perl.

# Always use strictures! use strict; use warnings; # Use nice names for variables. There is no such thing as saving time # using short variable names! However, don't use excessively long # names either. Naming things is an important part of programming! my %idLookup; my %firstLines; while (<DATA>) { chomp; # not chop! Remove line end from $_ next unless length; # Skip blank lines my $pair = $_; # Pull out the id strings my ($first, $second) = map {s/.*\.//; $_} split ' '; # idLookup is a hash of arrays. Push one id onto the array # accessed using the other id as the hash key push @{$idLookup{$first}}, [$second, $pair]; push @{$idLookup{$second}}, [$first, $pair]; # Build a hash keyed by id containing line numbers ($. is current +line number $firstLines{$first} = $. unless exists $firstLines{$first}; $firstLines{$second} = $. + 0.5 unless exists $firstLines{$second} +; } # Build a list of ids ordered by line number my @lineOrder = sort {$firstLines{$a} <=> $firstLines{$b}} keys %first +Lines; my $clusterNum; # Generate clusters in order of first occurrence of ids for my $id (@lineOrder) { # next if already allocated id to a cluster next unless exists $firstLines{$id}; my %hits = ($id => undef); # hash of ids in current cluster to pro +cess my @cluster; while (keys %hits) { my $hit = (keys %hits)[0]; # Get an id to process my $list = $idLookup{$hit}; # Get list of other ids current on +e is matched with delete $hits{$hit}; # Remove current id from the hit list # next if already processed the id next unless exists $firstLines{$hit}; delete $firstLines{$hit}; # Remove current id to indicate it's + processed for my $entry (@$list) { # Add matched ids to cluster my ($entryId, $pair) = @$entry; # next (don't add) if processed already next unless exists $firstLines{$entryId}; push @cluster, $pair; # Add the original line to the clust +er $hits{$entryId} = 1; # Add the matched id as a hit } } # print the cluster print "cluster", ++$clusterNum, "\n"; # Header line print join "\n", @cluster, "\n"; # Original lines }

Perl's payment curve coincides with its learning curve.

In reply to Re^3: clustering pairs by GrandFather
in thread clustering pairs by sugar

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.