Parse::RecDescent help

glwtta has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

I am having a bit of a problem coming up with a grammar to parse what looks like a very simple file. I can get it to work, but the resulting parser is excruciatingly slow - enough so to be completely unusable, and to make me think that it's possible to do better.

So, without further ado, the grammar I am using is:

sequences     : header sequence(s)
header        : seq_count app_number
seq_count    : "<160> NUMBER OF SEQ ID NOS:" /\d+/
app_number    : "<140> CURRENT APPLICATION NUMBER:" /[\w\/,]+/
sequence    : seq_id seq_length seq_type organism feat_token(s?) seq
seq_id        : "<210> SEQ ID NO" /\d+/
seq_length    : "<211> LENGTH:" /\d+/
seq_type    : "<212> TYPE:" type
type        : "DNA" | "PRT"
organism    : "<213> ORGANISM:" /\w+ \w+/
feat_token    : feature | name_key | location | other
feature        : "<220> FEATURE:" /[\w\s]*/
name_key    : "<221> NAME/KEY:" /\w+/
location    : "<222> LOCATION:" /[\d\.\(\)]+/
other        : "<223> OTHER INFORMATION:" /[^<]+/
seq        : "<400> SEQUENCE:" /\d+/ /[\w\s]+/
[download]

And the actual data I am trying to get at:

<160> NUMBER OF SEQ ID NOS: 727
<140> CURRENT APPLICATION NUMBER: US/09/984,429
<210> SEQ ID NO 1
<211> LENGTH: 733
<212> TYPE: DNA
<213> ORGANISM: Homo sapiens
<400> SEQUENCE: 1
      gggatccgga gcccaaatct tctgacaaaa ctcacacatg cccaccgtgc ccagcacct
+g     60
      aattcgaggg tgcaccgtca gtcttcctct tccccccaaa acccaaggac accctcatg
+a    120
      tctcccggac tcctgaggtc acatgcgtgg tggtggacgt aagccacgaa gaccctgag
+g    180
      tcaagttcaa ctggtacgtg gacggcgtgg aggtgcataa tgccaagaca aagccgcgg
+g    240
      aggagcagta caacagcacg taccgtgtgg tcagcgtcct caccgtcctg caccaggac
+t    300
      ggctgaatgg caaggagtac aagtgcaagg tctccaacaa agccctccca acccccatc
+g    360
      agaaaaccat ctccaaagcc aaagggcagc cccgagaacc acaggtgtac accctgccc
+c    420
      catcccggga tgagctgacc aagaaccagg tcagcctgac ctgcctggtc aaaggcttc
+t    480
      atccaagcga catcgccgtg gagtgggaga gcaatgggca gccggagaac aactacaag
+a    540
      ccacgcctcc cgtgctggac tccgacggct ccttcttcct ctacagcaag ctcaccgtg
+g    600
      acaagagcag gtggcagcag gggaacgtct tctcatgctc cgtgatgcat gaggctctg
+c    660
      acaaccacta cacgcagaag agcctctccc tgtctccggg taaatgagtg cgacggccg
+c    720
      gactctagag gat                                                  
+     733

<210> SEQ ID NO 2
<211> LENGTH: 5
<212> TYPE: PRT
<213> ORGANISM: Homo sapiens
<220> FEATURE: 
<221> NAME/KEY: Site
<222> LOCATION: (3)
<223> OTHER INFORMATION: Xaa equals any of the twenty naturally ocurri
+ng L-amino acids
<400> SEQUENCE: 2
      Trp Ser Xaa Trp Ser
        1               5

<210> SEQ ID NO 3
<211> LENGTH: 86
<212> TYPE: DNA
<213> ORGANISM: Artificial Sequence
<220> FEATURE: 
<221> NAME/KEY: Primer_Bind
<223> OTHER INFORMATION: Synthetic sequence with 4 tandem copies of th
+e GAS binding site
      found in the IRF1 promoter (Rothman et al., Immunity 1:457-468
      (1994)), 18 nucleotides complementary to the SV40 early promoter
+,
      and a Xho I restriction site.
<400> SEQUENCE: 3
      gcgcctcgag atttccccga aatctagatt tccccgaaat gatttccccg aaatgattt
+c     60
      cccgaaatat ctgccatctc aattag                                    
+      86
<210> SEQ ID NO 86
<211> LENGTH: 194
<212> TYPE: DNA
<213> ORGANISM: Artificial Sequence
<220> FEATURE: 
<223> OTHER INFORMATION: Amplimer
<400> SEQUENCE: 86
      tgcttggtga aggaatagcc accccagaga aggagtatgg acttctatac acaatcatt
+c 60
      attcattcat tcattcattc attcattcat tcattcacta ctcatgcatg atctttgtc
+c 120
      ttatcttcct ccactgtcac atgaataccc acccactgca cctacctgct tcctattcc
+t 180
      gagaacccag gctc                                                 
+  194

<210> SEQ ID NO 87
<211> LENGTH: 23
<212> TYPE: DNA
<213> ORGANISM: Homo sapiens
<400> SEQUENCE: 87
      ggcaatggag gagttccggg aca                                       
+  23
[download]

I suspect I am being overly greedy somewhere in the rules, slowing things down - any suggestions on how this can be improved? (As you can probably tell, I am extremely new to P::RD)

I don't need it to be blindingly fast, but currently each file (several megs) would take well over half an hour, and I have thousands of them!

thanks in advance

Comment on Parse::RecDescent help Select or Download Code

Replies are listed 'Best First'.

Re: Parse::RecDescent help
by kvale (Monsignor) on Mar 30, 2004 at 06:00 UTC

while (<DATA>) {
   if (/^<160> NUMBER OF SEQ ID NOS: (\d+)$/) {
      $seq_num = $1;
   }
   elsif (/^<140> CURRENT APPLICATION NUMBER: ([\w\/,]+)$/) {
      $appl_num = $1;
   }
   elsif (/^<210> SEQ ID NO (\d+)$/) {
      $cur_seq = $1;
      $seq{$cur_seq} = {};
   }
   elsif (/^<211> LENGTH: (\d+)$/) {
      $seq{$cur_seq}{length} = $1;
   }
   elsif (/^<212> TYPE: (DNA|PRT)$/) {
      $seq{$cur_seq}{type} = $1;
   }
   elsif (/^<213> ORGANISM: (\w+ \w+)$/) {
      $seq{$cur_seq}{organism} = $1;
   }
   elsif (/^<220> FEATURE:([\w\s]*)$/) {
      $seq{$cur_seq}{feature} = $1;
   }
   elsif (/^<221> NAME\/KEY: (\w+)$/) {
      $seq{$cur_seq}{name_key} = $1;
   }
   elsif (/^<222> LOCATION: ([\d\.\(\)]+)$/) {
      $seq{$cur_seq}{location} = $1;
   }
   elsif (/^<223> OTHER INFORMATION: ([^<]+)$/) {
      $seq{$cur_seq}{other} = $1;
   }
   elsif (/^<400> SEQUENCE: (\d+)$/) {
      $seq{$cur_seq}{seq_num} = $1;
   }
   elsif (/^([\w\s]+)$/) {
      $seq{$cur_seq}{seq} .= $1;
   }
   else {
      die "Unrecognized line: $_\n";
   }
}
foreach my $cur_seq (keys %seq) {
   print "Sequence: $cur_seq,   Organism: $seq{$cur_seq}{organism}\n";
   print "seq: $seq{$cur_seq}{seq}\n\n";
}

__DATA__
<160> NUMBER OF SEQ ID NOS: 727
<140> CURRENT APPLICATION NUMBER: US/09/984,429
<210> SEQ ID NO 1
<211> LENGTH: 733
<212> TYPE: DNA
<213> ORGANISM: Homo sapiens
<400> SEQUENCE: 1
      gggatccgga gcccaaatct tctgacaaaa ctcacacatg cccaccgtgc ccagcacct
+g     60
      aattcgaggg tgcaccgtca gtcttcctct tccccccaaa acccaaggac accctcatg
+a    120
      tctcccggac tcctgaggtc acatgcgtgg tggtggacgt aagccacgaa gaccctgag
+g    180
      tcaagttcaa ctggtacgtg gacggcgtgg aggtgcataa tgccaagaca aagccgcgg
+g    240
      aggagcagta caacagcacg taccgtgtgg tcagcgtcct caccgtcctg caccaggac
+t    300
      ggctgaatgg caaggagtac aagtgcaagg tctccaacaa agccctccca acccccatc
+g    360
      agaaaaccat ctccaaagcc aaagggcagc cccgagaacc acaggtgtac accctgccc
+c    420
      catcccggga tgagctgacc aagaaccagg tcagcctgac ctgcctggtc aaaggcttc
+t    480
      atccaagcga catcgccgtg gagtgggaga gcaatgggca gccggagaac aactacaag
+a    540
      ccacgcctcc cgtgctggac tccgacggct ccttcttcct ctacagcaag ctcaccgtg
+g    600
      acaagagcag gtggcagcag gggaacgtct tctcatgctc cgtgatgcat gaggctctg
+c    660
      acaaccacta cacgcagaag agcctctccc tgtctccggg taaatgagtg cgacggccg
+c    720
      gactctagag gat                                                  
+     733

<210> SEQ ID NO 2
<211> LENGTH: 5
<212> TYPE: PRT
<213> ORGANISM: Homo sapiens
<220> FEATURE:
<221> NAME/KEY: Site
<222> LOCATION: (3)
<223> OTHER INFORMATION: Xaa equals any of the twenty naturally ocurri
+ng L-amino acids
<400> SEQUENCE: 2
      Trp Ser Xaa Trp Ser
        1               5
[download]

-Mark

[reply]
[d/l]

Re: Re: Parse::RecDescent help

by PodMaster (Abbot) on Mar 30, 2004 at 07:34 UTC