Improve code to parse genetic record

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I request advice on how I might approach parsing the following example input *better* than the subroutine which will follow. That is, I've already written working code to do this. The input "records" end with 2 "empty" lines, and start with a row of mostly numbers preceded by whitespace (which you can't see as formatted).

example input (one record shown):

# args=-supermax -d -l 50 -h 7 -seedlength 7 -evalue 0.0001 -s 60 /hom
+e/anonymous_monk/clusters/all/all_clusters
   140     8778      333   D   140     8778      334  -7    3.60e-56  
+ -259    95.00
Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA   
+      393
       !    
                                                       
Sbjct: -AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA   
+      394

Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT   
+      453
Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT   
+      454

Sbjct: TTAAAATTCCCCCC-GGGGGG                                          
+      474
                     !      
Sbjct: TTAAAATTCCCCCCGGGGGGG                                          
+      475
[download]

As you can see, the two sequences may or may not be separated by a line with an exclamation mark somewhere in it (in this case they are both at the beginning but that is not guaranteed), and in order to grab the sequence lines I need to account for both possibilities. This is why in my current approach I use a for loop that indexes through the rows by number, so that I can "look ahead" by one or two rows. You probably already know what bothers me about this: when I see a ! row or the row corresponding to the second sequence in the pair, I've already looked at that row at least once but am testing it anyway.

subroutine code:

sub _parse_paired {
    my $this = shift;
    my $pkey = 1;
    my $c = { comments => '',
          left_instance => '',
          right_instance => '',
          match => '' };    
    
    ### build up each record and place in the collection ###
    $INPUT_RECORD_SEPARATOR = "\n\n\n";
    while (my $record = $this->{handle}->getline()) {
        my ($lt, $rt, $lseq, $rseq) = ();
        my @rows = split /\n/, $record;
        for (my $i = 0; $i < $#rows; $i++) {
            if ($rows[$i] =~ m/^\n?$/) { next; }
            elsif ($rows[$i] =~ m/^#/) {
                $c->{$pkey}->{comments} .= "$rows[$i]\n";
            }
            elsif ($rows[$i] =~ m/^\s+\d+/) {
                _load_stats($pkey, $rows[$i], $c);
            }
            elsif ($rows[$i] =~ m/^Sbjct/ &&
                 $rows[$i+1] =~ m/^Sbjct/) {
                (undef, $lt, undef) = split /\s+/, $rows[$i];
                (undef, $rt, undef) = split /\s+/, $rows[$i+1];
                $lseq .= $lt; $rseq .= $rt;
            }
            elsif ($rows[$i] =~ m/^Sbjct/ &&
                 $rows[$i+1] =~ m/!/) {
                (undef, $lt, undef) = split /\s+/, $rows[$i];
                (undef, $rt, undef) = split /\s+/, $rows[$i+2];
                $lseq .= $lt; $rseq .= $rt;
            }
        }
        $c->{$pkey}->{left_instance}->{sequence} = $lseq;
        $c->{$pkey}->{right_instance}->{sequence} = $rseq;
        ++$pkey;
    }

    return $c;
}
[download]

Many thanks for any and all constructive advice.

2003-04-30 edit ybiC: <tt> tags around example input record for legibility, <readmore> tags for frontpage space conservation

Edit by tye, TT -> CODE (and remove BR) so extra spaces show up

2003-05-01 edit ybiC: retitle from "More efficient?"

Comment on Improve code to parse genetic record Select or Download Code

Replies are listed 'Best First'.
Re: More efficient? by jkahn (Friar) on Apr 30, 2003 at 20:19 UTC
Here's my take on it. I'm not sure why the keys within `$c` at the top of the routine are being used the way they are, so I guessed that you really only want `$c` to be a ~~hash~~ hashref of `$pkey` values. I separated out the record processing into a separate routine for clarity. The way I would handle this is to keep a short buffer (here it's `@cognate_rows`) of the strings you expect to pair up. By the way, it's probably wise for you to investigate bioperl -- I bet this is a standard format. sub _parse_paired { my $this = shift; my $pkey = 1; # don't know why these keys are here... my $c = { comments => '', left_instance => '', right_instance => '', match => '' }; ### build up each record and place in the collection ### $INPUT_RECORD_SEPARATOR = "\n\n\n"; while (my $record = $this->{handle}->getline()) { my $rec_href = _build_record($record, $pkey); $c->{pkey} = $rec_href; ++$pkey; } return $c; } #here's the routine I factored out: sub _build_record { my ($record, $key) = (@_); my %data = (); # keys will be left_sequence and right_sequence my @rows = split /\n/, $record; my (@cognate_rows); my $curr_cognate_matches = 1; while (@rows) { local $_ = shift @rows; chomp; if (/^\s$/) { next; #skip blanks } if (/^\s+\d+/) { # you may have to adjust how _load_stats # works, or pass in $c to this routine. _load_stats($key, $_, \%data); } elsif (/^Sbjct/) { push @cognate_rows, $_; if (@cognate_rows == 2) { # we've found two rows that we expect to go together here. # if it matters, we know whether $curr_cognate_matches when we # reach this point my ($l, $r); (undef, $l, undef) = split /\s+/, $cognate_rows[0]; (undef, $r, undef) = split /\s+/, $cognate_rows[1]; $data{left_sequence} .= $l; $data{right_sequence} .= $r; # reset match to true $curr_cognate_matches = 1; # dump the buffer @cognate_rows = (); } } elsif (/!/) { # we know the current @cognate_rows don't* match $curr_cognate_matches = 0; next; #discard this line } } #end while rows return \%data; } [download] Code is completely untested. It compiles, under strict, provided you're using `English`. That's as far as I've gone to check this.	[reply] [d/l] [select]
Re: Re: More efficient? by Anonymous Monk on Apr 30, 2003 at 22:58 UTC
"By the way, it's probably wise for you to investigate bioperl -- I bet this is a standard format." It isn't. In fact when I asked the bioperl list if a parser exists for this tool, and if they would be interested in one, I got no answer.	[reply]
Re: More efficient? by BrowserUk (Patriarch) on Apr 30, 2003 at 22:28 UTC
Whether this will prove more efficient is speculative, but as the entire record is parsed in a single call to the regex engine, I think ought to be. It requires 5.8 because of the use of $^N which wasn't available before. I haven't wrapped it up as a function, I'm putting the gathered statistics into the hash for later processing and I've guessed as to where comment cards are possible, probably incorrectly. The sequence of regexes may look scarry, but they really aren't if you take the time to look at them and understand what they are doing. Many here are only too willing to explain anything that you don't follow, including myself. I could have littered the regexes with comments, but that just seems to make them look more complicated (to me). The essence of the technique used here is to follow each capture group with a `{?{ $var .= $^N })` code block, to assign or append the captured data as the regex progresses. This greatly simplifies the situation where you have optional capture groups in the regex, which means that you are never sure which part has been captured into which $n variable. #! perl -slw use strict; require 5.008; use Data::Dumper; use re 'eval'; my $re_comment = qr[^ \# ( \s* [^\n]+ )(?{ $c->{comment} .= $^N; })]x; my $re_stats = qr[^ \s+ ( \d+ [^\n]+ ) \n]x; my $re_seq = qr[^ Sbjct: \s+ ( [^ ]+ ) \s+ \d+ \n]x; my $re_bang = qr[^ \s* ! \s* \n]x; my $re_pair = qr[ $re_seq (?{ $c->{left_instance} .= $^N; }) (?:$re_bang)? $re_seq (?{ $c->{right_instance} .= $^N; }) $re_comment? \n ]x; my $re_record= qr[ $re_stats (?{ $c->{stats} .= $^N }) (?: $re_pair )+ ]x; local $/ = "\n\n\n"; while( <DATA> ) { our $c = {}; print m[$re_record]mo ? Dumper $c : 'Failed to match'; } __DATA__ 140 8778 333 D 140 8778 334 -7 3.60e-56 -259 95.00 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 39 +3 ! Sbjct: -AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 39 +4 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT 45 +3 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT 45 +4 Sbjct: TTAAAATTCCCCCC-GGGGGG 474 ! Sbjct: TTAAAATTCCCCCCGGGGGGG 475 140 8778 333 D 140 8778 334 -7 3.60e-56 -259 95.00 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 39 +3 ! Sbjct: -AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 39 +4 # some comment Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT 45 +3 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT 45 +4 # some more comment Sbjct: TTAAAATTCCCCCC-GGGGGG 474 ! Sbjct: TTAAAATTCCCCCCGGGGGGG 475 140 8778 333 D 140 8778 334 -7 3.60e-56 -259 95.00 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 39 +3 ! Sbjct: -AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 39 +4 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT 45 +3 Sbjct: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGCCTT 45 +4 Sbjct: TTAAAATTCCCCCC-GGGGGG 474 ! Sbjct: TTAAAATTCCCCCCGGGGGGG 475 [download] Ouput D:\Perl\test>254442 $VAR1 = { 'right_instance' => '-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG +GGGGGGGGCCTTTTAAAATTCCCCCCGGGGGGG', 'stats' => '140 8778 333 D 140 8778 334 -7 3.60e-56 -259 95. +00', 'left_instance' => 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGG +GGGGGGGCCTTTTAAAATTCCCCCC-GGGGGG' }; $VAR1 = { 'comment' => ' some comment some more comment', 'right_instance' => '-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG +GGGGGGGGCCTTTTAAAATTCCCCCCGGGGGGG', 'stats' => '140 8778 333 D 140 8778 334 -7 3.60e-56 -259 95. +00', 'left_instance' => 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGG +GGGGGGGCCTTTTAAAATTCCCCCC-GGGGGG' }; $VAR1 = { 'right_instance' => '-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG +GGGGGGGGCCTTTTAAAATTCCCCCCGGGGGGG', 'stats' => '140 8778 333 D 140 8778 334 -7 3.60e-56 -259 95. +00', 'left_instance' => 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGG +GGGGGGGCCTTTTAAAATTCCCCCC-GGGGGG' }; [download] Examine what is said, not who speaks. 1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong. 2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible 3) Any sufficiently advanced technology is indistinguishable from magic. Arthur C. Clarke.	[reply] [d/l] [select]
Re: More efficient? by Enlil (Parson) on Apr 30, 2003 at 21:28 UTC
Here is my take on your problem, though I might have misinterpreted some things. I am assuming your record is in the following order: might or might not have a comment at start might or might not have that row of mostly numbers first part of right sequence might or might not have a ! preceded by optional whitespace first part of left sequence a blank line, followed by more parts of the right and left sequence From this data you want populate a hash with the information. So that for each piece of data you have: `#$pkey is just a counter $c->{$pkey}{left_instance}{sequence}; $c->{$pkey}{right_instance}{sequence}; $c->{$pkey}{comments}; #i am guessing $c->{$pkey}{match} as well but don't know #what that does.` [download] I am also guessing that since you pass your hashref to `_load_stats` that you further add data from the "stats" line into your hash, so I added `$c->{$pkey}{stats}` as well. If nothing else it might give you a different approach from which to work from. Again keep in mind that I am just working from your one example of a record, some of my assumptions in my regular expressions will probably need some tweaking. Anyhow code follows: Read more... (4 kB) -enlil	[reply] [d/l] [select]
Re: More efficient? by tall_man (Parson) on Apr 30, 2003 at 20:24 UTC
If the "!" lines are getting in the way, why not substitute them out before you split the record? `my @bangs = $record =~ m/^[^\n]\![^\n]\n/mg; $record =~ s/^[^\n]\![^\n]\n//mg;` [download] Update: I have fixed the above pattern to avoid the evil ".*" (per Death to Dot Star!). Since the "!" lines have information, I gather the assumptions Enlil and BrowserUK made about whitespace are incorrect. I have added a capturing match to take care of keeping the information.	[reply] [d/l]
Re: Re: More efficient? by Anonymous Monk on Apr 30, 2003 at 23:00 UTC
Because although for my quick'n'dirty purposes, this would suffice, the program is intended to be a parser and as the ! lines do represent information, I should have the means in place to grab them.	[reply]