NAME
DESCRIPTION
- Brain Damaged Pseudo-XML
- Speed on Large Files Depresses Me
SEE ALSO
AUTHOR

NAME

Parse::RecDescent Woes - help a fellow perl hacker out

DESCRIPTION

I have a couple of Parse::RecDescent problems that I haven't been able to resolve for a while, now. I'm posting this on perlmonks.org in the hope that someone out there may have a solution for me.

There is a file format developed by lineo.com called ECD that I have written a parser for. It is used to store meta-data for each of our packages in the Embedix Linux distro.

I decided to write a module that could both parse and generate ECD data, because I was working in packaging at the time, and it would have made my life a lot easier. It was my first time using Parse::RecDescent, and it was quite nice once I got the hang of it. I can parse over 90% of the ECD files we have right now, but there are two problems that prevent me from getting that last 10%.

Brain Damaged Pseudo-XML

ECD files look like pseudo-XML. The closest thing to its syntax that may be familiar to the readers out there is the format used for configuring Apache. (apachestyle.vim *almost* works for ECD files)

Because of this, I used demo_simpleXML.pl that comes with the Parse::RecDescent distribution as a starting point. Here is the grammar that comes with it:

    xml:  unitag
       |  tag content[$item1](s) endtag[$item1]
                                    { bless $item2, $item1}

    unitag:
            m{<(a-zA-Z+)/>}       { bless [], $1 }

    tag:
            m{<(a-zA-Z+)>}        { $return = $1 }

    endtag:
            m{</$arg[0]>}
          | m{(\S+)} <error: was expecting </$arg[0]> but found $1 instead>

    content: <rulevar: local $error>
    content: rawtext <commit> check[$arg[0], $item1]
           | xml   <commit> check[$arg[0], $item1]
           | <error?: $error>  <error: error in <$arg[0]> block>

    rawtext: m{^&lt;+}               { bless \$item1, 'rawtext' }

    check: { my ($outertag, $innertag) = ($arg[0], ref $arg1);
             $return = $arg1 if !$thisparser->{allow}{$outertag}
                               || $innertag =~ $thisparser->{allow}{$outertag};
             $error = ($innertag eq 'rawtext')
                    ? "Raw text not valid in <$outertag> block"
                    : "<$innertag> tag not valid in <$outertag> block";
             undef;
            }

The part I want to focus on is the definition of rawtext. It is defined to be a sequence of one or more characters that does not contain '<'. My problem is that I need to be able to accept '<' inside a rawtext section...

and I haven't been able to do it. I've tried all kinds of funky regexes to no avail. I even tried that look-ahead stuff for the first time, but I couldn't get it to cooperate.

If __ANYONE__ out there can hack demo_simpleXML.pl to be able to accept '<' inside a rawtext section, and show me how to do it, I would be super-grateful. I'd even have to do something nice for you in return. This problem has had me stumped for over a month.

Speed on Large Files Depresses Me

I don't know if there is anything that can be done about the second problem. Most ECD files are fairly small being somewhere between 2KB to 8KB in size. These pose no problems, but occasionally there is an ECD file that ranges between 40KB and 80KB and these take too long to parse.

For instance, parsing busybox.ecd which is around 60KB takes this long on my 600MHz VAIO:

    beppu@thestruggle$ time checkecd busybox.ecd

    real    0m31.873s
    user    0m30.280s
    sys     0m0.190s

There is also the mother of all ECD files (linux.ecd) which is a little over 1MB in size. I pity Parse::RecDescent when I hand it the contents of linux.ecd in a scalar. For shits and giggles, I let it run on my other box at home which is a 500MHz PII. It sat there for 13 hours, bringing the CPU to full utilization, and using up between 60 and 80 MB of RAM before it died when it got caught by the previously mentioned '<'-in-rawtext bug.

What's even worse, the CTO here at lineo wrote a parser in Python a while back, and it's a lot faster. It takes a line-by-line approach, and doesn't keep a lot of data in memory while it parses. It's not like his code is very beautiful or his library has a nice API, but I can't deny that my perl module is a lot slower than his python library.

What should I do?

use real XML

My current plan is to side-step the issue and convert the ECD files to XML, because XML::Parser is really fast. The XML version of busybox.ecd gets parsed in half a second instead of 30+ seconds. The hard part will be convincing everybody else to make the switch.

find another parsing module

Have any of you tried any of the other parsing modules availble at CPAN? What have your experiences been?

anything else?

I'm open to suggestions.

Thanks for your time. If anyone can figure out how to allow '<' in a rawtext section, that'd be really cool.

AUTHOR

John Beppu <beppu@lineo.com>

In reply to Parse::RecDescent Woes by beppu

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.

NAME

DESCRIPTION

Brain Damaged Pseudo-XML

Speed on Large Files Depresses Me

SEE ALSO

AUTHOR