beppu has asked for the wisdom of the Perl Monks concerning the following question:
Parse::RecDescent Woes - help a fellow perl hacker out
I have a couple of Parse::RecDescent problems that I haven't been able to resolve for a while, now. I'm posting this on perlmonks.org in the hope that someone out there may have a solution for me.
There is a file format developed by lineo.com called ECD that I have written a parser for. It is used to store meta-data for each of our packages in the Embedix Linux distro.
I decided to write a module that could both parse and generate ECD data, because I was working in packaging at the time, and it would have made my life a lot easier. It was my first time using Parse::RecDescent, and it was quite nice once I got the hang of it. I can parse over 90% of the ECD files we have right now, but there are two problems that prevent me from getting that last 10%.
ECD files look like pseudo-XML. The closest thing to its syntax that may be familiar to the readers out there is the format used for configuring Apache. (apachestyle.vim *almost* works for ECD files)
Because of this, I used demo_simpleXML.pl that comes with the Parse::RecDescent distribution as a starting point. Here is the grammar that comes with it:
xml: unitag
| tag content[$item1](s) endtag[$item1]
{ bless $item2, $item1}
unitag:
m{<(a-zA-Z+)/>} { bless [], $1 }
tag:
m{<(a-zA-Z+)>} { $return = $1 }
endtag:
m{</$arg[0]>}
| m{(\S+)} <error: was expecting </$arg[0]> but found $1 instead>
content: <rulevar: local $error>
content: rawtext <commit> check[$arg[0], $item1]
| xml <commit> check[$arg[0], $item1]
| <error?: $error> <error: error in <$arg[0]> block>
rawtext: m{^<+} { bless \$item1, 'rawtext' }
check: { my ($outertag, $innertag) = ($arg[0], ref $arg1);
$return = $arg1 if !$thisparser->{allow}{$outertag}
|| $innertag =~ $thisparser->{allow}{$outertag};
$error = ($innertag eq 'rawtext')
? "Raw text not valid in <$outertag> block"
: "<$innertag> tag not valid in <$outertag> block";
undef;
}
The part I want to focus on is the definition of rawtext. It is defined to be a sequence of one or more characters that does not contain '<'. My problem is that I need to be able to accept '<' inside a rawtext section...
and I haven't been able to do it. I've tried all kinds of funky regexes to no avail. I even tried that look-ahead stuff for the first time, but I couldn't get it to cooperate.
If __ANYONE__ out there can hack demo_simpleXML.pl to be able to accept '<' inside a rawtext section, and show me how to do it, I would be super-grateful. I'd even have to do something nice for you in return. This problem has had me stumped for over a month.
I don't know if there is anything that can be done about the second problem. Most ECD files are fairly small being somewhere between 2KB to 8KB in size. These pose no problems, but occasionally there is an ECD file that ranges between 40KB and 80KB and these take too long to parse.
For instance, parsing busybox.ecd which is around 60KB takes this long on my 600MHz VAIO:
beppu@thestruggle$ time checkecd busybox.ecd
real 0m31.873s
user 0m30.280s
sys 0m0.190s
There is also the mother of all ECD files (linux.ecd) which is a little over 1MB in size. I pity Parse::RecDescent when I hand it the contents of linux.ecd in a scalar. For shits and giggles, I let it run on my other box at home which is a 500MHz PII. It sat there for 13 hours, bringing the CPU to full utilization, and using up between 60 and 80 MB of RAM before it died when it got caught by the previously mentioned '<'-in-rawtext bug.
What's even worse, the CTO here at lineo wrote a parser in Python a while back, and it's a lot faster. It takes a line-by-line approach, and doesn't keep a lot of data in memory while it parses. It's not like his code is very beautiful or his library has a nice API, but I can't deny that my perl module is a lot slower than his python library.
What should I do?
My current plan is to side-step the issue and convert the ECD files to XML, because XML::Parser is really fast. The XML version of busybox.ecd gets parsed in half a second instead of 30+ seconds. The hard part will be convincing everybody else to make the switch.
Have any of you tried any of the other parsing modules availble at CPAN? What have your experiences been?
I'm open to suggestions.
Thanks for your time. If anyone can figure out how to allow '<' in a rawtext section, that'd be really cool.
This module can be downloaded from CPAN.
perl -MCPAN -e 'install "Embedix::ECD"'
You can also read more about it here:
http://opensource.lineo.com/~beppu/perl/Embedix-ECD.html
http://search.cpan.org/search?dist=Parse-RecDescent
John Beppu <beppu@lineo.com>
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Parse::RecDescent Woes
by chipmunk (Parson) on Jan 04, 2001 at 20:22 UTC | |
by beppu (Hermit) on Jan 05, 2001 at 06:58 UTC | |
|
Re: Parse::RecDescent Woes
by mirod (Canon) on Jan 04, 2001 at 19:49 UTC | |
|
Re: Parse::RecDescent Woes
by little (Curate) on Jan 04, 2001 at 14:39 UTC | |
by beppu (Hermit) on Jan 04, 2001 at 15:17 UTC |