glwtta has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

I am having a bit of a problem coming up with a grammar to parse what looks like a very simple file. I can get it to work, but the resulting parser is excruciatingly slow - enough so to be completely unusable, and to make me think that it's possible to do better.

So, without further ado, the grammar I am using is:

sequences : header sequence(s) header : seq_count app_number seq_count : "<160> NUMBER OF SEQ ID NOS:" /\d+/ app_number : "<140> CURRENT APPLICATION NUMBER:" /[\w\/,]+/ sequence : seq_id seq_length seq_type organism feat_token(s?) seq seq_id : "<210> SEQ ID NO" /\d+/ seq_length : "<211> LENGTH:" /\d+/ seq_type : "<212> TYPE:" type type : "DNA" | "PRT" organism : "<213> ORGANISM:" /\w+ \w+/ feat_token : feature | name_key | location | other feature : "<220> FEATURE:" /[\w\s]*/ name_key : "<221> NAME/KEY:" /\w+/ location : "<222> LOCATION:" /[\d\.\(\)]+/ other : "<223> OTHER INFORMATION:" /[^<]+/ seq : "<400> SEQUENCE:" /\d+/ /[\w\s]+/
And the actual data I am trying to get at:
<160> NUMBER OF SEQ ID NOS: 727 <140> CURRENT APPLICATION NUMBER: US/09/984,429 <210> SEQ ID NO 1 <211> LENGTH: 733 <212> TYPE: DNA <213> ORGANISM: Homo sapiens <400> SEQUENCE: 1 gggatccgga gcccaaatct tctgacaaaa ctcacacatg cccaccgtgc ccagcacct +g 60 aattcgaggg tgcaccgtca gtcttcctct tccccccaaa acccaaggac accctcatg +a 120 tctcccggac tcctgaggtc acatgcgtgg tggtggacgt aagccacgaa gaccctgag +g 180 tcaagttcaa ctggtacgtg gacggcgtgg aggtgcataa tgccaagaca aagccgcgg +g 240 aggagcagta caacagcacg taccgtgtgg tcagcgtcct caccgtcctg caccaggac +t 300 ggctgaatgg caaggagtac aagtgcaagg tctccaacaa agccctccca acccccatc +g 360 agaaaaccat ctccaaagcc aaagggcagc cccgagaacc acaggtgtac accctgccc +c 420 catcccggga tgagctgacc aagaaccagg tcagcctgac ctgcctggtc aaaggcttc +t 480 atccaagcga catcgccgtg gagtgggaga gcaatgggca gccggagaac aactacaag +a 540 ccacgcctcc cgtgctggac tccgacggct ccttcttcct ctacagcaag ctcaccgtg +g 600 acaagagcag gtggcagcag gggaacgtct tctcatgctc cgtgatgcat gaggctctg +c 660 acaaccacta cacgcagaag agcctctccc tgtctccggg taaatgagtg cgacggccg +c 720 gactctagag gat + 733 <210> SEQ ID NO 2 <211> LENGTH: 5 <212> TYPE: PRT <213> ORGANISM: Homo sapiens <220> FEATURE: <221> NAME/KEY: Site <222> LOCATION: (3) <223> OTHER INFORMATION: Xaa equals any of the twenty naturally ocurri +ng L-amino acids <400> SEQUENCE: 2 Trp Ser Xaa Trp Ser 1 5 <210> SEQ ID NO 3 <211> LENGTH: 86 <212> TYPE: DNA <213> ORGANISM: Artificial Sequence <220> FEATURE: <221> NAME/KEY: Primer_Bind <223> OTHER INFORMATION: Synthetic sequence with 4 tandem copies of th +e GAS binding site found in the IRF1 promoter (Rothman et al., Immunity 1:457-468 (1994)), 18 nucleotides complementary to the SV40 early promoter +, and a Xho I restriction site. <400> SEQUENCE: 3 gcgcctcgag atttccccga aatctagatt tccccgaaat gatttccccg aaatgattt +c 60 cccgaaatat ctgccatctc aattag + 86 <210> SEQ ID NO 86 <211> LENGTH: 194 <212> TYPE: DNA <213> ORGANISM: Artificial Sequence <220> FEATURE: <223> OTHER INFORMATION: Amplimer <400> SEQUENCE: 86 tgcttggtga aggaatagcc accccagaga aggagtatgg acttctatac acaatcatt +c 60 attcattcat tcattcattc attcattcat tcattcacta ctcatgcatg atctttgtc +c 120 ttatcttcct ccactgtcac atgaataccc acccactgca cctacctgct tcctattcc +t 180 gagaacccag gctc + 194 <210> SEQ ID NO 87 <211> LENGTH: 23 <212> TYPE: DNA <213> ORGANISM: Homo sapiens <400> SEQUENCE: 87 ggcaatggag gagttccggg aca + 23

I suspect I am being overly greedy somewhere in the rules, slowing things down - any suggestions on how this can be improved? (As you can probably tell, I am extremely new to P::RD)

I don't need it to be blindingly fast, but currently each file (several megs) would take well over half an hour, and I have thousands of them!

thanks in advance

Replies are listed 'Best First'.
Re: Parse::RecDescent help
by kvale (Monsignor) on Mar 30, 2004 at 06:00 UTC
    I don't see any obvious problems with your grammar. Indeed, it is very simple, no recursion and little ambiguity. Sometimes P::RD is slower than one would like. But parsing (tens of) gigabytes is a big job in any case. If P::RD is too slow, you might try to roll your own parser:
    while (<DATA>) { if (/^<160> NUMBER OF SEQ ID NOS: (\d+)$/) { $seq_num = $1; } elsif (/^<140> CURRENT APPLICATION NUMBER: ([\w\/,]+)$/) { $appl_num = $1; } elsif (/^<210> SEQ ID NO (\d+)$/) { $cur_seq = $1; $seq{$cur_seq} = {}; } elsif (/^<211> LENGTH: (\d+)$/) { $seq{$cur_seq}{length} = $1; } elsif (/^<212> TYPE: (DNA|PRT)$/) { $seq{$cur_seq}{type} = $1; } elsif (/^<213> ORGANISM: (\w+ \w+)$/) { $seq{$cur_seq}{organism} = $1; } elsif (/^<220> FEATURE:([\w\s]*)$/) { $seq{$cur_seq}{feature} = $1; } elsif (/^<221> NAME\/KEY: (\w+)$/) { $seq{$cur_seq}{name_key} = $1; } elsif (/^<222> LOCATION: ([\d\.\(\)]+)$/) { $seq{$cur_seq}{location} = $1; } elsif (/^<223> OTHER INFORMATION: ([^<]+)$/) { $seq{$cur_seq}{other} = $1; } elsif (/^<400> SEQUENCE: (\d+)$/) { $seq{$cur_seq}{seq_num} = $1; } elsif (/^([\w\s]+)$/) { $seq{$cur_seq}{seq} .= $1; } else { die "Unrecognized line: $_\n"; } } foreach my $cur_seq (keys %seq) { print "Sequence: $cur_seq, Organism: $seq{$cur_seq}{organism}\n"; print "seq: $seq{$cur_seq}{seq}\n\n"; } __DATA__ <160> NUMBER OF SEQ ID NOS: 727 <140> CURRENT APPLICATION NUMBER: US/09/984,429 <210> SEQ ID NO 1 <211> LENGTH: 733 <212> TYPE: DNA <213> ORGANISM: Homo sapiens <400> SEQUENCE: 1 gggatccgga gcccaaatct tctgacaaaa ctcacacatg cccaccgtgc ccagcacct +g 60 aattcgaggg tgcaccgtca gtcttcctct tccccccaaa acccaaggac accctcatg +a 120 tctcccggac tcctgaggtc acatgcgtgg tggtggacgt aagccacgaa gaccctgag +g 180 tcaagttcaa ctggtacgtg gacggcgtgg aggtgcataa tgccaagaca aagccgcgg +g 240 aggagcagta caacagcacg taccgtgtgg tcagcgtcct caccgtcctg caccaggac +t 300 ggctgaatgg caaggagtac aagtgcaagg tctccaacaa agccctccca acccccatc +g 360 agaaaaccat ctccaaagcc aaagggcagc cccgagaacc acaggtgtac accctgccc +c 420 catcccggga tgagctgacc aagaaccagg tcagcctgac ctgcctggtc aaaggcttc +t 480 atccaagcga catcgccgtg gagtgggaga gcaatgggca gccggagaac aactacaag +a 540 ccacgcctcc cgtgctggac tccgacggct ccttcttcct ctacagcaag ctcaccgtg +g 600 acaagagcag gtggcagcag gggaacgtct tctcatgctc cgtgatgcat gaggctctg +c 660 acaaccacta cacgcagaag agcctctccc tgtctccggg taaatgagtg cgacggccg +c 720 gactctagag gat + 733 <210> SEQ ID NO 2 <211> LENGTH: 5 <212> TYPE: PRT <213> ORGANISM: Homo sapiens <220> FEATURE: <221> NAME/KEY: Site <222> LOCATION: (3) <223> OTHER INFORMATION: Xaa equals any of the twenty naturally ocurri +ng L-amino acids <400> SEQUENCE: 2 Trp Ser Xaa Trp Ser 1 5

    -Mark

      I think that's an excellent idea, especially given the general consistency of the format (only 210 is inconsistent in that it doesn't use a :, and then 400 is redundant in that it lists the seq id first). Here's my whack:

      MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
      I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
      ** The third rule of perl club is a statement of fact: pod is sexy.

Re: Parse::RecDescent help
by gjb (Vicar) on Mar 30, 2004 at 07:38 UTC

    Unless I miss something obvious, the language you're parsing is regular. Parse::RecDescent is a tool to deal with context free languages. Since every regular language is context free, you can use Parse::RecDescent to deal with it, but it's overkill.

    Parsing a context free language is a relatively costly process compared to parsing a regular language (O(n^3) vs. O(n)) and since you're parsing lots of data, that is quite a difference. In this case it would almost certainly be much faster to implement your own parser as was suggested above.

    Just my 2 cents, -gjb-

Re: Parse::RecDescent help
by allolex (Curate) on Mar 30, 2004 at 09:27 UTC
bioperl instead of Parse::RecDescent?
by PodMaster (Abbot) on Mar 30, 2004 at 05:40 UTC
    Ignoring your problem for a second, that looks like some gene sequence stuff ie bioperl, is it? If it is, chances are there already exist tools for dealing with those files (maybe bioperl-db).

    Now, as to your grammar, did you really mean to hardcode the linenumber in seq_count    : "<160> NUMBER OF SEQ ID NOS:" /\d+/? Are you sure you didn't want seq_count    : "<" /\d+/ "> NUMBER OF SEQ ID NOS:" /\d+/?
    update: Take a look at OPTIMIZING YOUR GRAMMARS in Parse::RecDescent::FAQ.

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

      I'm pretty sure that this format is something the PSIPS people just made up and that BP doesn't have anything for it (not that their stuff is exactly speedy, either).
Re: Parse::RecDescent help
by glwtta (Hermit) on Mar 30, 2004 at 15:55 UTC
    Sorry, I should've mentioned that I don't really have any problem writing the parser for this myself (and I do realize that P::RD is way overkill for the problem at hand) - it just looked like a simple enough problem to get some practice with P::RD and I wanted to confirm that it is that slow.

    Thanks for all the help though.