Here's my take on it. I'm not sure why the keys within
$c at the top of the routine are being used the way they are, so I guessed that you really only want
$c to be a
hash hashref of
$pkey values.
I separated out the record processing into a separate routine for clarity. The way I would handle this is to keep a short buffer (here it's @cognate_rows) of the strings you expect to pair up.
By the way, it's probably wise for you to investigate bioperl -- I bet this is a standard format.
sub _parse_paired {
my $this = shift;
my $pkey = 1;
# don't know why these keys are here...
my $c = { comments => '',
left_instance => '',
right_instance => '',
match => '' };
### build up each record and place in the collection ###
$INPUT_RECORD_SEPARATOR = "\n\n\n";
while (my $record = $this->{handle}->getline()) {
my $rec_href = _build_record($record, $pkey);
$c->{pkey} = $rec_href;
++$pkey;
}
return $c;
}
#here's the routine I factored out:
sub _build_record {
my ($record, $key) = (@_);
my %data = ();
# keys will be left_sequence and right_sequence
my @rows = split /\n/, $record;
my (@cognate_rows);
my $curr_cognate_matches = 1;
while (@rows) {
local $_ = shift @rows;
chomp;
if (/^\s*$/) {
next; #skip blanks
}
if (/^\s+\d+/) {
# you may have to adjust how _load_stats
# works, or pass in $c to this routine.
_load_stats($key, $_, \%data);
}
elsif (/^Sbjct/) {
push @cognate_rows, $_;
if (@cognate_rows == 2) {
# we've found two rows that we expect to go together here.
# if it matters, we know whether $curr_cognate_matches when we
# reach this point
my ($l, $r);
(undef, $l, undef) = split /\s+/, $cognate_rows[0];
(undef, $r, undef) = split /\s+/, $cognate_rows[1];
$data{left_sequence} .= $l;
$data{right_sequence} .= $r;
# reset match to true
$curr_cognate_matches = 1;
# dump the buffer
@cognate_rows = ();
}
}
elsif (/!/) {
# we know the current @cognate_rows *don't* match
$curr_cognate_matches = 0;
next; #discard this line
}
} #end while rows
return \%data;
}
Code is completely untested. It compiles, under strict, provided you're using English. That's as far as I've gone to check this.