Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
I request advice on how I might approach parsing the following example input *better* than the subroutine which will follow. That is, I've already written working code to do this. The input "records" end with 2 "empty" lines, and start with a row of mostly numbers preceded by whitespace (which you can't see as formatted).
example input (one record shown):
# args=-supermax -d -l 50 -h 7 -seedlength 7 -evalue 0.0001 -s 60 /hom +e/anonymous_monk/clusters/all/all_clusters 140 8778 333 D 140 8778 334 -7 3.60e-56 + -259 95.00 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + 393 ! Sbjct: -AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + 394 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT + 453 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT + 454 Sbjct: TTAAAATTCCCCCC-GGGGGG + 474 ! Sbjct: TTAAAATTCCCCCCGGGGGGG + 475
As you can see, the two sequences may or may not be separated by a line with an exclamation mark somewhere in it (in this case they are both at the beginning but that is not guaranteed), and in order to grab the sequence lines I need to account for both possibilities. This is why in my current approach I use a for loop that indexes through the rows by number, so that I can "look ahead" by one or two rows. You probably already know what bothers me about this: when I see a ! row or the row corresponding to the second sequence in the pair, I've already looked at that row at least once but am testing it anyway.
subroutine code:
sub _parse_paired { my $this = shift; my $pkey = 1; my $c = { comments => '', left_instance => '', right_instance => '', match => '' }; ### build up each record and place in the collection ### $INPUT_RECORD_SEPARATOR = "\n\n\n"; while (my $record = $this->{handle}->getline()) { my ($lt, $rt, $lseq, $rseq) = (); my @rows = split /\n/, $record; for (my $i = 0; $i < $#rows; $i++) { if ($rows[$i] =~ m/^\n?$/) { next; } elsif ($rows[$i] =~ m/^#/) { $c->{$pkey}->{comments} .= "$rows[$i]\n"; } elsif ($rows[$i] =~ m/^\s+\d+/) { _load_stats($pkey, $rows[$i], $c); } elsif ($rows[$i] =~ m/^Sbjct/ && $rows[$i+1] =~ m/^Sbjct/) { (undef, $lt, undef) = split /\s+/, $rows[$i]; (undef, $rt, undef) = split /\s+/, $rows[$i+1]; $lseq .= $lt; $rseq .= $rt; } elsif ($rows[$i] =~ m/^Sbjct/ && $rows[$i+1] =~ m/!/) { (undef, $lt, undef) = split /\s+/, $rows[$i]; (undef, $rt, undef) = split /\s+/, $rows[$i+2]; $lseq .= $lt; $rseq .= $rt; } } $c->{$pkey}->{left_instance}->{sequence} = $lseq; $c->{$pkey}->{right_instance}->{sequence} = $rseq; ++$pkey; } return $c; }
Many thanks for any and all constructive advice.
2003-04-30 edit ybiC: <tt> tags around example input record for legibility, <readmore> tags for frontpage space conservation
Edit by tye, TT -> CODE (and remove BR) so extra spaces show up
2003-05-01 edit ybiC: retitle from "More efficient?"
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: More efficient?
by jkahn (Friar) on Apr 30, 2003 at 20:19 UTC | |
by Anonymous Monk on Apr 30, 2003 at 22:58 UTC | |
|
Re: More efficient?
by BrowserUk (Patriarch) on Apr 30, 2003 at 22:28 UTC | |
|
Re: More efficient?
by Enlil (Parson) on Apr 30, 2003 at 21:28 UTC | |
|
Re: More efficient?
by tall_man (Parson) on Apr 30, 2003 at 20:24 UTC | |
by Anonymous Monk on Apr 30, 2003 at 23:00 UTC |