comment on

Hello Monks,

I request advice on how I might approach parsing the following example input *better* than the subroutine which will follow. That is, I've already written working code to do this. The input "records" end with 2 "empty" lines, and start with a row of mostly numbers preceded by whitespace (which you can't see as formatted).

example input (one record shown):

# args=-supermax -d -l 50 -h 7 -seedlength 7 -evalue 0.0001 -s 60 /hom
+e/anonymous_monk/clusters/all/all_clusters
   140     8778      333   D   140     8778      334  -7    3.60e-56  
+ -259    95.00
Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA   
+      393
       !    
                                                       
Sbjct: -AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA   
+      394

Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT   
+      453
Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT   
+      454

Sbjct: TTAAAATTCCCCCC-GGGGGG                                          
+      474
                     !      
Sbjct: TTAAAATTCCCCCCGGGGGGG                                          
+      475
[download]

As you can see, the two sequences may or may not be separated by a line with an exclamation mark somewhere in it (in this case they are both at the beginning but that is not guaranteed), and in order to grab the sequence lines I need to account for both possibilities. This is why in my current approach I use a for loop that indexes through the rows by number, so that I can "look ahead" by one or two rows. You probably already know what bothers me about this: when I see a ! row or the row corresponding to the second sequence in the pair, I've already looked at that row at least once but am testing it anyway.

subroutine code:

sub _parse_paired {
    my $this = shift;
    my $pkey = 1;
    my $c = { comments => '',
          left_instance => '',
          right_instance => '',
          match => '' };    
    
    ### build up each record and place in the collection ###
    $INPUT_RECORD_SEPARATOR = "\n\n\n";
    while (my $record = $this->{handle}->getline()) {
        my ($lt, $rt, $lseq, $rseq) = ();
        my @rows = split /\n/, $record;
        for (my $i = 0; $i < $#rows; $i++) {
            if ($rows[$i] =~ m/^\n?$/) { next; }
            elsif ($rows[$i] =~ m/^#/) {
                $c->{$pkey}->{comments} .= "$rows[$i]\n";
            }
            elsif ($rows[$i] =~ m/^\s+\d+/) {
                _load_stats($pkey, $rows[$i], $c);
            }
            elsif ($rows[$i] =~ m/^Sbjct/ &&
                 $rows[$i+1] =~ m/^Sbjct/) {
                (undef, $lt, undef) = split /\s+/, $rows[$i];
                (undef, $rt, undef) = split /\s+/, $rows[$i+1];
                $lseq .= $lt; $rseq .= $rt;
            }
            elsif ($rows[$i] =~ m/^Sbjct/ &&
                 $rows[$i+1] =~ m/!/) {
                (undef, $lt, undef) = split /\s+/, $rows[$i];
                (undef, $rt, undef) = split /\s+/, $rows[$i+2];
                $lseq .= $lt; $rseq .= $rt;
            }
        }
        $c->{$pkey}->{left_instance}->{sequence} = $lseq;
        $c->{$pkey}->{right_instance}->{sequence} = $rseq;
        ++$pkey;
    }

    return $c;
}
[download]

Many thanks for any and all constructive advice.

2003-04-30 edit ybiC: <tt> tags around example input record for legibility, <readmore> tags for frontpage space conservation

Edit by tye, TT -> CODE (and remove BR) so extra spaces show up

2003-05-01 edit ybiC: retitle from "More efficient?"

In reply to Improve code to parse genetic record by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.