in reply to Improve code to parse genetic record

Here is my take on your problem, though I might have misinterpreted some things. I am assuming your record is in the following order:
  1. might or might not have a comment at start
  2. might or might not have that row of mostly numbers
  3. first part of right sequence
  4. might or might not have a ! preceded by optional whitespace
  5. first part of left sequence
  6. a blank line, followed by more parts of the right and left sequence
From this data you want populate a hash with the information. So that for each piece of data you have:
#$pkey is just a counter $c->{$pkey}{left_instance}{sequence}; $c->{$pkey}{right_instance}{sequence}; $c->{$pkey}{comments}; #i am guessing $c->{$pkey}{match} as well but don't know #what that does.
I am also guessing that since you pass your hashref to _load_stats that you further add data from the "stats" line into your hash, so I added $c->{$pkey}{stats} as well. If nothing else it might give you a different approach from which to work from. Again keep in mind that I am just working from your one example of a record, some of my assumptions in my regular expressions will probably need some tweaking. Anyhow code follows:
use strict; use warnings; use Data::Dumper; $/ = "\n\n\n"; my $pkey = 1; my $c = {}; while ( <DATA> ) { #get comments if any $c->{$pkey}{comments} = $1 if s/^(#[^\n]+)\n//; #get stats if any _load_stats($pkey,$1,$c) if !/^Sbjct/ and s/\s*(\d[^\n]+)//; #loop over remaining data to get sequences while ( /Sbjct: ([-ACGT]+)\s+\d+\n(?:\s*!\s+)?Sbjct: ([-ACTG]+)/g) { $c->{$pkey}{left_instance}{sequence} .= $1; $c->{$pkey}{right_instance}{sequence} .= $2; } #add 1 to $pkey for next one #though we should probably be using an AoH #instead of a HoA so we don't need the counter #and can just use push. $pkey++; } #print resulting structure for verification print Dumper $c; #don't know what _load_stats really does #so this will need changing sub _load_stats { my ($pkey,$data,$href) = @_; $href->{$pkey}{stats} = [split ' ',$data]; } ##_load_stats __DATA__ # args=-supermax -d -l 50 -h 7 -seedlength 7 -evalue 0.0001 -s 60 /hom +e/anonymous_monk/clusters/all/all_clusters 140 8778 333 D 140 8778 334 -7 3.60e-56 -2 +59 95.00 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + 393 ! Sbjct: -AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + 394 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT + 453 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT + 454 Sbjct: TTAAAATTCCCCCC-GGGGGG + 474 ! Sbjct: TTAAAATTCCCCCCGGGGGGG + 475 170 8778 333 D 140 8778 334 -7 3.60e-56 + -259 95.00 Sbjct: CAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + 393 ! Sbjct: -CAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + 394 Sbjct: CAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT + 453 Sbjct: TAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT + 454 Sbjct: GTAAAATTCCCCCC-GGGGGG + 474 ! Sbjct: GTAAAATTCCCCCCGGGGGGG + 475 # args=-supermax -d -l 50 -h 7 -seedlength 7 -evalue 0.0001 -s 655 /ho +me/anonymous_monk/clusters/all/all_clusters Sbjct: CAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + 393 ! Sbjct: -CAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + 394 Sbjct: CAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT + 453 Sbjct: TAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT + 454 Sbjct: CCAAAATTCCCCCC-GGGGGG + 474 ! Sbjct: CCAAAATTCCCCCCGGGGGGG + 475

-enlil