comment on

Greetings Monks,

I am having a bit of a problem coming up with a grammar to parse what looks like a very simple file. I can get it to work, but the resulting parser is excruciatingly slow - enough so to be completely unusable, and to make me think that it's possible to do better.

So, without further ado, the grammar I am using is:

sequences     : header sequence(s)
header        : seq_count app_number
seq_count    : "<160> NUMBER OF SEQ ID NOS:" /\d+/
app_number    : "<140> CURRENT APPLICATION NUMBER:" /[\w\/,]+/
sequence    : seq_id seq_length seq_type organism feat_token(s?) seq
seq_id        : "<210> SEQ ID NO" /\d+/
seq_length    : "<211> LENGTH:" /\d+/
seq_type    : "<212> TYPE:" type
type        : "DNA" | "PRT"
organism    : "<213> ORGANISM:" /\w+ \w+/
feat_token    : feature | name_key | location | other
feature        : "<220> FEATURE:" /[\w\s]*/
name_key    : "<221> NAME/KEY:" /\w+/
location    : "<222> LOCATION:" /[\d\.\(\)]+/
other        : "<223> OTHER INFORMATION:" /[^<]+/
seq        : "<400> SEQUENCE:" /\d+/ /[\w\s]+/
[download]

And the actual data I am trying to get at:

<160> NUMBER OF SEQ ID NOS: 727
<140> CURRENT APPLICATION NUMBER: US/09/984,429
<210> SEQ ID NO 1
<211> LENGTH: 733
<212> TYPE: DNA
<213> ORGANISM: Homo sapiens
<400> SEQUENCE: 1
      gggatccgga gcccaaatct tctgacaaaa ctcacacatg cccaccgtgc ccagcacct
+g     60
      aattcgaggg tgcaccgtca gtcttcctct tccccccaaa acccaaggac accctcatg
+a    120
      tctcccggac tcctgaggtc acatgcgtgg tggtggacgt aagccacgaa gaccctgag
+g    180
      tcaagttcaa ctggtacgtg gacggcgtgg aggtgcataa tgccaagaca aagccgcgg
+g    240
      aggagcagta caacagcacg taccgtgtgg tcagcgtcct caccgtcctg caccaggac
+t    300
      ggctgaatgg caaggagtac aagtgcaagg tctccaacaa agccctccca acccccatc
+g    360
      agaaaaccat ctccaaagcc aaagggcagc cccgagaacc acaggtgtac accctgccc
+c    420
      catcccggga tgagctgacc aagaaccagg tcagcctgac ctgcctggtc aaaggcttc
+t    480
      atccaagcga catcgccgtg gagtgggaga gcaatgggca gccggagaac aactacaag
+a    540
      ccacgcctcc cgtgctggac tccgacggct ccttcttcct ctacagcaag ctcaccgtg
+g    600
      acaagagcag gtggcagcag gggaacgtct tctcatgctc cgtgatgcat gaggctctg
+c    660
      acaaccacta cacgcagaag agcctctccc tgtctccggg taaatgagtg cgacggccg
+c    720
      gactctagag gat                                                  
+     733

<210> SEQ ID NO 2
<211> LENGTH: 5
<212> TYPE: PRT
<213> ORGANISM: Homo sapiens
<220> FEATURE: 
<221> NAME/KEY: Site
<222> LOCATION: (3)
<223> OTHER INFORMATION: Xaa equals any of the twenty naturally ocurri
+ng L-amino acids
<400> SEQUENCE: 2
      Trp Ser Xaa Trp Ser
        1               5

<210> SEQ ID NO 3
<211> LENGTH: 86
<212> TYPE: DNA
<213> ORGANISM: Artificial Sequence
<220> FEATURE: 
<221> NAME/KEY: Primer_Bind
<223> OTHER INFORMATION: Synthetic sequence with 4 tandem copies of th
+e GAS binding site
      found in the IRF1 promoter (Rothman et al., Immunity 1:457-468
      (1994)), 18 nucleotides complementary to the SV40 early promoter
+,
      and a Xho I restriction site.
<400> SEQUENCE: 3
      gcgcctcgag atttccccga aatctagatt tccccgaaat gatttccccg aaatgattt
+c     60
      cccgaaatat ctgccatctc aattag                                    
+      86
<210> SEQ ID NO 86
<211> LENGTH: 194
<212> TYPE: DNA
<213> ORGANISM: Artificial Sequence
<220> FEATURE: 
<223> OTHER INFORMATION: Amplimer
<400> SEQUENCE: 86
      tgcttggtga aggaatagcc accccagaga aggagtatgg acttctatac acaatcatt
+c 60
      attcattcat tcattcattc attcattcat tcattcacta ctcatgcatg atctttgtc
+c 120
      ttatcttcct ccactgtcac atgaataccc acccactgca cctacctgct tcctattcc
+t 180
      gagaacccag gctc                                                 
+  194

<210> SEQ ID NO 87
<211> LENGTH: 23
<212> TYPE: DNA
<213> ORGANISM: Homo sapiens
<400> SEQUENCE: 87
      ggcaatggag gagttccggg aca                                       
+  23
[download]

I suspect I am being overly greedy somewhere in the rules, slowing things down - any suggestions on how this can be improved? (As you can probably tell, I am extremely new to P::RD)

I don't need it to be blindingly fast, but currently each file (several megs) would take well over half an hour, and I have thousands of them!

thanks in advance

In reply to Parse::RecDescent help by glwtta

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.