comment on

I am trying to write a parser to handle human-generated "info files" that accompany the type of legal live concert recordings you can find at http://bt.etree.org (see for example http://bklyn.org/~cae/info-files/mmw2002-04-20.txt and numerous other examples in http://bklyn.org/~cae/info-files/)

These generally follow a common structure, but since they are typed up by hand there can be a lot of variation. The overall structure is usually something along the lines of band name, date, venue, source and transfer information, and then setlist/tracking info.

Because of the irregular structure, I am finding writing a pure token-based parser is pretty tricky. I have a halfway-decent line-oriented parser that I've implemented mostly as a bunch of "if" statements which test against some state variables and regular expressions which match certain tell-tale strings (for example different brands of microphones, DAT decks, concert hall names, state abbreviations, etc). For some masochistic reason though, I've decided that I need to reimplement this using a proper grammar and Parse::RecDescent seems like a good fit. But maybe not.

As I said, I'm having difficulty with the token-based nature of P::RD. In some cases I want things split up word-wise, but in others I'd prefer to look for strings anywhere within a line (e.g. microphone names like "Schoeps" are a pretty good indicator that I'm dealing with source info and that is pretty much guaranteed to span an entire line).

Here's

to do the same thing

I've tried my hand at using the <skip> directive with a little luck (see the "artist" rule which seems to work well), and also some spectacular failures: if I try to use it in the source or sourceinfo rules, things end up not matching.

I'm also having difficulty with some of my rules being to greedy and am not sure how to stop them. For example, the "source" rule as written often ends up gobbling the tokens like "Disc 1" which I'm hoping to match with the "disc" rule or "Set I" which I try to match with the "set" rule. I've tried using ...!rule a bit, but again with little luck. I'd like to have some way to tell the parser that a newline should (usually) signal the end of a rule.

If anyone has any advice, I'd greatly appreciate it. It may be the case that the data set I'm working with is just NOT suited to this type of parsing, but I don't think I know enough about the solution domain to reach this decision myself.

In reply to Need Parse::RecDescent by Anonymous Monk my line-based parser Here's the skeletal Parse::RecDescent parser I'm trying to use class='editnodetext'> Help

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.