Here is my take on your problem, though I might have misinterpreted some things. I am assuming your record is in the following order:
  1. might or might not have a comment at start
  2. might or might not have that row of mostly numbers
  3. first part of right sequence
  4. might or might not have a ! preceded by optional whitespace
  5. first part of left sequence
  6. a blank line, followed by more parts of the right and left sequence
From this data you want populate a hash with the information. So that for each piece of data you have:
#$pkey is just a counter $c->{$pkey}{left_instance}{sequence}; $c->{$pkey}{right_instance}{sequence}; $c->{$pkey}{comments}; #i am guessing $c->{$pkey}{match} as well but don't know #what that does.
I am also guessing that since you pass your hashref to _load_stats that you further add data from the "stats" line into your hash, so I added $c->{$pkey}{stats} as well. If nothing else it might give you a different approach from which to work from. Again keep in mind that I am just working from your one example of a record, some of my assumptions in my regular expressions will probably need some tweaking. Anyhow code follows:
use strict; use warnings; use Data::Dumper; $/ = "\n\n\n"; my $pkey = 1; my $c = {}; while ( <DATA> ) { #get comments if any $c->{$pkey}{comments} = $1 if s/^(#[^\n]+)\n//; #get stats if any _load_stats($pkey,$1,$c) if !/^Sbjct/ and s/\s*(\d[^\n]+)//; #loop over remaining data to get sequences while ( /Sbjct: ([-ACGT]+)\s+\d+\n(?:\s*!\s+)?Sbjct: ([-ACTG]+)/g) { $c->{$pkey}{left_instance}{sequence} .= $1; $c->{$pkey}{right_instance}{sequence} .= $2; } #add 1 to $pkey for next one #though we should probably be using an AoH #instead of a HoA so we don't need the counter #and can just use push. $pkey++; } #print resulting structure for verification print Dumper $c; #don't know what _load_stats really does #so this will need changing sub _load_stats { my ($pkey,$data,$href) = @_; $href->{$pkey}{stats} = [split ' ',$data]; } ##_load_stats __DATA__ # args=-supermax -d -l 50 -h 7 -seedlength 7 -evalue 0.0001 -s 60 /hom +e/anonymous_monk/clusters/all/all_clusters 140 8778 333 D 140 8778 334 -7 3.60e-56 -2 +59 95.00 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + 393 ! Sbjct: -AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + 394 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT + 453 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT + 454 Sbjct: TTAAAATTCCCCCC-GGGGGG + 474 ! Sbjct: TTAAAATTCCCCCCGGGGGGG + 475 170 8778 333 D 140 8778 334 -7 3.60e-56 + -259 95.00 Sbjct: CAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + 393 ! Sbjct: -CAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + 394 Sbjct: CAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT + 453 Sbjct: TAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT + 454 Sbjct: GTAAAATTCCCCCC-GGGGGG + 474 ! Sbjct: GTAAAATTCCCCCCGGGGGGG + 475 # args=-supermax -d -l 50 -h 7 -seedlength 7 -evalue 0.0001 -s 655 /ho +me/anonymous_monk/clusters/all/all_clusters Sbjct: CAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + 393 ! Sbjct: -CAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + 394 Sbjct: CAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT + 453 Sbjct: TAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT + 454 Sbjct: CCAAAATTCCCCCC-GGGGGG + 474 ! Sbjct: CCAAAATTCCCCCCGGGGGGG + 475

-enlil


In reply to Re: More efficient? by Enlil
in thread Improve code to parse genetic record by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.