Hello Monks,

I request advice on how I might approach parsing the following example input *better* than the subroutine which will follow. That is, I've already written working code to do this. The input "records" end with 2 "empty" lines, and start with a row of mostly numbers preceded by whitespace (which you can't see as formatted).

example input (one record shown):

# args=-supermax -d -l 50 -h 7 -seedlength 7 -evalue 0.0001 -s 60 /hom +e/anonymous_monk/clusters/all/all_clusters 140 8778 333 D 140 8778 334 -7 3.60e-56 + -259 95.00 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + 393 ! Sbjct: -AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + 394 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT + 453 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT + 454 Sbjct: TTAAAATTCCCCCC-GGGGGG + 474 ! Sbjct: TTAAAATTCCCCCCGGGGGGG + 475

As you can see, the two sequences may or may not be separated by a line with an exclamation mark somewhere in it (in this case they are both at the beginning but that is not guaranteed), and in order to grab the sequence lines I need to account for both possibilities. This is why in my current approach I use a for loop that indexes through the rows by number, so that I can "look ahead" by one or two rows. You probably already know what bothers me about this: when I see a ! row or the row corresponding to the second sequence in the pair, I've already looked at that row at least once but am testing it anyway.

subroutine code:

sub _parse_paired { my $this = shift; my $pkey = 1; my $c = { comments => '', left_instance => '', right_instance => '', match => '' }; ### build up each record and place in the collection ### $INPUT_RECORD_SEPARATOR = "\n\n\n"; while (my $record = $this->{handle}->getline()) { my ($lt, $rt, $lseq, $rseq) = (); my @rows = split /\n/, $record; for (my $i = 0; $i < $#rows; $i++) { if ($rows[$i] =~ m/^\n?$/) { next; } elsif ($rows[$i] =~ m/^#/) { $c->{$pkey}->{comments} .= "$rows[$i]\n"; } elsif ($rows[$i] =~ m/^\s+\d+/) { _load_stats($pkey, $rows[$i], $c); } elsif ($rows[$i] =~ m/^Sbjct/ && $rows[$i+1] =~ m/^Sbjct/) { (undef, $lt, undef) = split /\s+/, $rows[$i]; (undef, $rt, undef) = split /\s+/, $rows[$i+1]; $lseq .= $lt; $rseq .= $rt; } elsif ($rows[$i] =~ m/^Sbjct/ && $rows[$i+1] =~ m/!/) { (undef, $lt, undef) = split /\s+/, $rows[$i]; (undef, $rt, undef) = split /\s+/, $rows[$i+2]; $lseq .= $lt; $rseq .= $rt; } } $c->{$pkey}->{left_instance}->{sequence} = $lseq; $c->{$pkey}->{right_instance}->{sequence} = $rseq; ++$pkey; } return $c; }

Many thanks for any and all constructive advice.

2003-04-30 edit ybiC: <tt> tags around example input record for legibility, <readmore> tags for frontpage space conservation

Edit by tye, TT -> CODE (and remove BR) so extra spaces show up

2003-05-01 edit ybiC: retitle from "More efficient?"


In reply to Improve code to parse genetic record by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.