beppu has asked for the wisdom of the Perl Monks concerning the following question:


NAME

Parse::RecDescent Woes - help a fellow perl hacker out


DESCRIPTION

I have a couple of Parse::RecDescent problems that I haven't been able to resolve for a while, now. I'm posting this on perlmonks.org in the hope that someone out there may have a solution for me.

There is a file format developed by lineo.com called ECD that I have written a parser for. It is used to store meta-data for each of our packages in the Embedix Linux distro.

I decided to write a module that could both parse and generate ECD data, because I was working in packaging at the time, and it would have made my life a lot easier. It was my first time using Parse::RecDescent, and it was quite nice once I got the hang of it. I can parse over 90% of the ECD files we have right now, but there are two problems that prevent me from getting that last 10%.


Brain Damaged Pseudo-XML

ECD files look like pseudo-XML. The closest thing to its syntax that may be familiar to the readers out there is the format used for configuring Apache. (apachestyle.vim *almost* works for ECD files)

Because of this, I used demo_simpleXML.pl that comes with the Parse::RecDescent distribution as a starting point. Here is the grammar that comes with it:

    xml:  unitag
       |  tag content[$item1](s) endtag[$item1]
                                    { bless $item2, $item1}

    unitag:
            m{<(a-zA-Z+)/>}       { bless [], $1 }

    tag:
            m{<(a-zA-Z+)>}        { $return = $1 }

    endtag:
            m{</$arg[0]>}
          | m{(\S+)} <error: was expecting </$arg[0]> but found $1 instead>

    content: <rulevar: local $error>
    content: rawtext <commit> check[$arg[0], $item1]
           | xml   <commit> check[$arg[0], $item1]
           | <error?: $error>  <error: error in <$arg[0]> block>

    rawtext: m{^&lt;+}               { bless \$item1, 'rawtext' }

    check: { my ($outertag, $innertag) = ($arg[0], ref $arg1);
             $return = $arg1 if !$thisparser->{allow}{$outertag}
                               || $innertag =~ $thisparser->{allow}{$outertag};
             $error = ($innertag eq 'rawtext')
                    ? "Raw text not valid in <$outertag> block"
                    : "<$innertag> tag not valid in <$outertag> block";
             undef;
            }

The part I want to focus on is the definition of rawtext. It is defined to be a sequence of one or more characters that does not contain '<'. My problem is that I need to be able to accept '<' inside a rawtext section...

and I haven't been able to do it. I've tried all kinds of funky regexes to no avail. I even tried that look-ahead stuff for the first time, but I couldn't get it to cooperate.

If __ANYONE__ out there can hack demo_simpleXML.pl to be able to accept '<' inside a rawtext section, and show me how to do it, I would be super-grateful. I'd even have to do something nice for you in return. This problem has had me stumped for over a month.


Speed on Large Files Depresses Me

I don't know if there is anything that can be done about the second problem. Most ECD files are fairly small being somewhere between 2KB to 8KB in size. These pose no problems, but occasionally there is an ECD file that ranges between 40KB and 80KB and these take too long to parse.

For instance, parsing busybox.ecd which is around 60KB takes this long on my 600MHz VAIO:

    beppu@thestruggle$ time checkecd busybox.ecd 

    real    0m31.873s
    user    0m30.280s
    sys     0m0.190s

There is also the mother of all ECD files (linux.ecd) which is a little over 1MB in size. I pity Parse::RecDescent when I hand it the contents of linux.ecd in a scalar. For shits and giggles, I let it run on my other box at home which is a 500MHz PII. It sat there for 13 hours, bringing the CPU to full utilization, and using up between 60 and 80 MB of RAM before it died when it got caught by the previously mentioned '<'-in-rawtext bug.

What's even worse, the CTO here at lineo wrote a parser in Python a while back, and it's a lot faster. It takes a line-by-line approach, and doesn't keep a lot of data in memory while it parses. It's not like his code is very beautiful or his library has a nice API, but I can't deny that my perl module is a lot slower than his python library.

What should I do?

use real XML

My current plan is to side-step the issue and convert the ECD files to XML, because XML::Parser is really fast. The XML version of busybox.ecd gets parsed in half a second instead of 30+ seconds. The hard part will be convincing everybody else to make the switch.

find another parsing module

Have any of you tried any of the other parsing modules availble at CPAN? What have your experiences been?

anything else?

I'm open to suggestions.

Thanks for your time. If anyone can figure out how to allow '<' in a rawtext section, that'd be really cool.


SEE ALSO

Embedix::ECD

This module can be downloaded from CPAN.

    perl -MCPAN -e 'install "Embedix::ECD"'

You can also read more about it here:

    http://opensource.lineo.com/~beppu/perl/Embedix-ECD.html

Parse::RecDescent

    http://search.cpan.org/search?dist=Parse-RecDescent


AUTHOR

John Beppu <beppu@lineo.com>

Replies are listed 'Best First'.
Re: Parse::RecDescent Woes
by chipmunk (Parson) on Jan 04, 2001 at 20:22 UTC
    This answer assumes that you stick with ECD as the file format. With < allowed in raw text, the trick is determining whether a given < starts a tag or is just part of raw text. I think the rule you want is: m{<(?=(?:/[a-zA-Z]+|[a-zA-Z]+/?)>)} matches < if and only if it's the start of a tag, endtag, or unitag.

    In that case, I think a regex like this would work as the definition for rawtext: m{(?:[^<]|<(?!(?:/[a-zA-Z]+|[a-zA-Z]+/?)>))+} I did a few quick tests of this regex with demo_simpleXML.pl, and it worked as intended.

    (The redundant [a-zA-Z]+ could be eliminated using the (?(condition)...) regex feature, added in perl5.005: m{(?:[^<]|<(?!(/)?[a-zA-Z]+(?(1)|/?)>))+} If the (/) matches, then (?(1)|/?) will match the null string; if the (/) does not match, then (?(1)|/?) will match /?. So, / can be at the beginning or the end, but not both. )
      chipmunk, you are a master. Thanks so much for that regular expression.

      I predict that you will defeat Ovid in the Iron Perl Monks battle. Nothing against Ovid, of course. I'm sure he's a nice guy.

Re: Parse::RecDescent Woes
by mirod (Canon) on Jan 04, 2001 at 19:49 UTC

    The Real XML Way (tm) to solve the < problem is to replace it by &lt;, although this implies that every software that uses the file either knows how to replace &lt; by < or (even better) uses an XML parser to filter the input.

    As I mentionned in On XML Parsing any approach not based on an XML parser (including your boss' Python hack) does not usually process the full range of XML feature, so if you chose to move to XML as a format do yourself a favor and use one.

    If you chose to use XML you can then use CDATA sections to embed < in elements:

    <elt><![CDATA[it is now safe to use <, > & and the likes here]]></elt>

    As for XML modules look at Module Reviews for a bunch of reviews and at my site http://www.xmltwig.com/article/ for reviews and benchmarks.

Re: Parse::RecDescent Woes
by little (Curate) on Jan 04, 2001 at 14:39 UTC
    Great work !!
    ehm, yeah, better use XML and find a way for a strict DTD to avoid minor bugs arising from spelling errors and create an appropriate XSLT to convert from XML into what you need in your ecd file, so you parse xml but deliver *.ecd files.
    Mhm, I guess thats a point for perl in processing XML, eh ?

    Have a nice day
    All decision is left to your taste

      I have to admit that I'm an XML-newbie. I personally want to move to XML, because the ECD format already looks kinda-sorta like XML. Ideally, the ECD format would disappear, and there would be an XECD format that would have a nice DTD and would contain all the same information the normal ECD format does right now..... and all our parsing troubles would go away.

      The decision is not really up to me, though -- I'm just a grunt. ;-)