comment on

I've provided a copy of the code below with a few comments. Some stuff you should consult the Perl documentation for - learning to do that is probably more valuable than anything I can convey in a short reply. Start at perl.

# Always use strictures!
use strict;
use warnings;

# Use nice names for variables. There is no such thing as saving time
# using short variable names! However, don't use excessively long 
# names either. Naming things is an important part of programming!
my %idLookup;
my %firstLines;

while (<DATA>) {
    chomp; # not chop! Remove line end from $_
    next unless length; # Skip blank lines

    my $pair = $_;

    # Pull out the id strings
    my ($first, $second) = map {s/.*\.//; $_} split ' ';

    # idLookup is a hash of arrays. Push one id onto the array
    # accessed using the other id as the hash key
    push @{$idLookup{$first}}, [$second, $pair];
    push @{$idLookup{$second}}, [$first, $pair];

    # Build a hash keyed by id containing line numbers ($. is current 
+line number
    $firstLines{$first} = $. unless exists $firstLines{$first};
    $firstLines{$second} = $. + 0.5 unless exists $firstLines{$second}
+;
}

# Build a list of ids ordered by line number
my @lineOrder = sort {$firstLines{$a} <=> $firstLines{$b}} keys %first
+Lines;
my $clusterNum;

# Generate clusters in order of first occurrence of ids
for my $id (@lineOrder) {
    # next if already allocated id to a cluster
    next unless exists $firstLines{$id}; 

    my %hits = ($id => undef); # hash of ids in current cluster to pro
+cess
    my @cluster;

    while (keys %hits) {
        my $hit = (keys %hits)[0]; # Get an id to process
        my $list = $idLookup{$hit}; # Get list of other ids current on
+e is matched with

        delete $hits{$hit}; # Remove current id from the hit list

        # next if already processed the id
        next unless exists $firstLines{$hit}; 
        delete $firstLines{$hit}; # Remove current id to indicate it's
+ processed

        for my $entry (@$list) {
            # Add matched ids to cluster
            my ($entryId, $pair) = @$entry;

            # next (don't add) if processed already
            next unless exists $firstLines{$entryId};
            push @cluster, $pair; # Add the original line to the clust
+er
            $hits{$entryId} = 1; # Add the matched id as a hit
        }
    }

    # print the cluster
    print "cluster", ++$clusterNum, "\n"; # Header line
    print join "\n", @cluster, "\n"; # Original lines
}
[download]

Perl's payment curve coincides with its learning curve.

In reply to Re^3: clustering pairs by GrandFather
in thread clustering pairs by sugar

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.