Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to write a parser to handle human-generated "info files" that accompany the type of legal live concert recordings you can find at http://bt.etree.org (see for example http://bklyn.org/~cae/info-files/mmw2002-04-20.txt and numerous other examples in http://bklyn.org/~cae/info-files/)

These generally follow a common structure, but since they are typed up by hand there can be a lot of variation. The overall structure is usually something along the lines of band name, date, venue, source and transfer information, and then setlist/tracking info.

Because of the irregular structure, I am finding writing a pure token-based parser is pretty tricky. I have a halfway-decent line-oriented parser that I've implemented mostly as a bunch of "if" statements which test against some state variables and regular expressions which match certain tell-tale strings (for example different brands of microphones, DAT decks, concert hall names, state abbreviations, etc). For some masochistic reason though, I've decided that I need to reimplement this using a proper grammar and Parse::RecDescent seems like a good fit. But maybe not.

As I said, I'm having difficulty with the token-based nature of P::RD. In some cases I want things split up word-wise, but in others I'd prefer to look for strings anywhere within a line (e.g. microphone names like "Schoeps" are a pretty good indicator that I'm dealing with source info and that is pretty much guaranteed to span an entire line).

Here's my line-based parser

Here's the skeletal Parse::RecDescent parser I'm trying to use to do the same thing

I've tried my hand at using the <skip> directive with a little luck (see the "artist" rule which seems to work well), and also some spectacular failures: if I try to use it in the source or sourceinfo rules, things end up not matching.

I'm also having difficulty with some of my rules being to greedy and am not sure how to stop them. For example, the "source" rule as written often ends up gobbling the tokens like "Disc 1" which I'm hoping to match with the "disc" rule or "Set I" which I try to match with the "set" rule. I've tried using ...!rule a bit, but again with little luck. I'd like to have some way to tell the parser that a newline should (usually) signal the end of a rule.

If anyone has any advice, I'd greatly appreciate it. It may be the case that the data set I'm working with is just NOT suited to this type of parsing, but I don't think I know enough about the solution domain to reach this decision myself.

Replies are listed 'Best First'.
Re: Need Parse::RecDescent Help
by kvale (Monsignor) on Mar 09, 2004 at 17:26 UTC
    As you have found, parsing natural language is not an easy task! People have tried to create grammars for natural language; they start out simple, but all thecombinatorial possibilities and special case results in a mess of rules that are difficult to make sense of.

    It seems to me that without a consistent structure, a hierarchical grammar will not really help you. But you can loosen the rules of a grammar to accept more possibilities. For instance,

    showinfo : artist date location | artist punct date location | date artist location | artist location date
    has only some possible combinations, but
    showinfo: showpart(s) showpart: artist | date | punct | location
    will allow the showinfo components to be in any order.

    -Mark

      Thats not really the problem, and things like "punct" can appear in the name of an artist even (e.g. "Medeski Martin & Wood"). What I really need is help recognizing patterns like: the brand-name of a (microphone|dat deck|pre-amp) appears *anywhere on the entire line*, and I haven't seen any source info yet" and have that be my "sourceinfo" rule. I'm having difficulty making the token-based parser relax to match an entire line I guess.
Re: Need Parse::RecDescent Help
by Bklyn (Initiate) on Mar 09, 2004 at 17:04 UTC
    I guess I was not properly logged in when I posted this question, but I am its owner. I also forgot to properly tag a couple of the links in there.

    You can find legal live concert recordings (the genesis of this problem) at the etree.org bittorrent site.

    Here is the sample info file I referred to mmw2002-04-20.txt. Here is the index of a bunch of info files.

Re: Need Parse::RecDescent Help
by paulbort (Hermit) on Mar 09, 2004 at 22:02 UTC
    This might be one of the cases where it's faster to have the human do the heavy lifting. Have you considered changing the script to just display the file a line at a time, and prompt the user for some input about what the line is? The script could take a guess using your existing code, and use that guess as a default, but for a reasonable number of files (I'd say less than a thousand), it probably makes sense to make the human's job easier rather than try to replace entirely. (Like some OCR programs that will prompt for things they don't get, with a 'best guess' displayed for consideration.)

    --
    Spring: Forces, Coiled Again!