tybalt89's solution processes an input file line-by-line and so has the advantage that it will scale to an input file of any size (well, as long as your HD will hold both the input and output files :).

It seems to me to have the disadvantage of... terseness, shall we say? Let me offer an alternative that is line-by-line and that also:

So, FWIW:
Script:
use warnings; use strict; use autodie; use Test::More 'no_plan'; use Test::NoWarnings; use Data::Dump qw(dd); use constant EXPECTED => <<EOT; Osat_a Atha_b, Mtru_c Fves_d, Osat_e, Atha_f Atha_g Osat_h, Atha_i Mtru_j EOT open my $fh_out, '>', \ my $output; # simulate file i/o for testing # regexes of cluster representative and member record fields. use constant RX_REP => qr{ \b [[:upper:]] [[:lower:]]+ _ [[:lower:]] \ +b }xms; use constant RX_MBR => qr{ \b [[:upper:]] [[:lower:]]+ _ [[:lower:]] \ +b }xms; use constant SEP => ', '; # output record separator use constant TERM => "\n"; # output record terminator # WARNING: the empty string SHALL NOT be a possible cluster representa +tive. my $previous_rep = ''; RECORD: while (my $record = <DATA>) { # ignore... next RECORD unless $record =~ m{ \S }xms; # blank lines my $parsed = my ($rep, $member) = $record =~ m{ \A \s* (${ \RX_REP }) \s+ # representative (${ \RX_MBR }) \s* # member (?: [#] [^\n]*)? # optional comment \Z }xmsg; die "bad record: '$record'" unless $parsed; my $new_cluster_begins = $previous_rep ne $rep; if ($new_cluster_begins) { # at start of each new cluster, cluster representative # must be same as cluster member. $rep eq $member or die # just checking "representative '$rep' not same as member '$member' ", "at start of new cluster '$record'"; # mark new cluster. $previous_rep = $rep; # terminate current cluster, if any; begin new one. terminate_cluster($fh_out); begin_new_cluster($fh_out, $rep); # representative/member same in new cluster: ignore append. next RECORD; } # not start of new cluster: append latest member to output. append_to_cluster($fh_out, $member); } # terminate final cluster, if any. terminate_cluster($fh_out); is $output, EXPECTED, "test output"; done_testing; close $fh_out; exit; # subroutines ###################################################### { # begin function closure my $begun; # private: output has begun; initial value false # begin new cluster. sub begin_new_cluster { my ($fh, # file handle: output stream $representative, # str: representative ) = @_; $begun = # if print succeeds, we've begun print $fh $representative; } # append latest member to output record. sub append_to_cluster { my ($fh, # file handle: output stream $member, # str: member ) = @_; print $fh SEP, $member; } # terminate current cluster, if any. sub terminate_cluster { my ($fh, # file handle: output stream ) = @_; return unless $begun; # output not begun yet: do nothing print $fh TERM; } } # end function closure __DATA__ Osat_a Osat_a # just one cluster member Atha_b Atha_b # >1 cluster member, this & next line = 2 members Atha_b Mtru_c Fves_d Fves_d # this & next 2 lines = 3 cluster members Fves_d Osat_e Fves_d Atha_f Atha_g Atha_g # just 1 cluster member Osat_h Osat_h Osat_h Atha_i Mtru_j Mtru_j # just 1 cluster member
Output:
c:\@Work\Perl\monks\onlyIDleft>perl process_cluster_info_3.pl ok 1 - test output 1..1 ok 2 - no warnings 1..2


Give a man a fish:  <%-{-{-{-<


In reply to Re: Processing while reading in input by AnomalousMonk
in thread Processing while reading in input by onlyIDleft

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.