I'm working on a Perl script that takes apart nasty legalese, stores it in a database, and reassembles it on request based on parameters.

So far, I'm good on everything except the parsing. The text is straight ascii in the form:

 (a)blahblahblahblah
 (1)blahblahblahblah
 (A)blahblahblahblah
 (i)blahblahblahblah
Where each multi-line section is: and the indicator is in the progression of a-z, each with possible "children" of 1-???, each with possible children of A-Z, each with possible children of (roman numerals).

The parser needs to be able to identify each section, as well as understand it's parentage. (i.e. b.2.C.iii would have to know that it was not only iii, but also a "child" of b.2.C)

I wrote up a chunky little parser that does the deed, but I've run into complications:

As far as I can tell, the best way to deal with this is to use a real parser that will evaluate the entire text rather than considering each line as mostly distinct as I do now. Is this a task for Parse::RecDescent? The documentation really seems to assume experience with parsers, does anyone have a good starting point? Has anyone done anything similar to this?

In reply to Parsing the Law by swiftone

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.