I'm working on a Perl script that takes apart nasty legalese, stores it in a database, and reassembles it on request based on parameters.
So far, I'm good on everything except the parsing. The text is straight ascii in the form:
(a)blahblahblahblah
(1)blahblahblahblah
(A)blahblahblahblah
(i)blahblahblahblah
Where each multi-line section is:
- space Indented (but not in varying widths)
- Begins with an indicator in parens
and the indicator is in the progression of
a-z, each with possible "children" of 1-???, each with possible children of A-Z, each with possible children of (roman numerals).
The parser needs to be able to identify each section, as well as understand it's parentage. (i.e. b.2.C.iii would have to know that it was not only iii, but also a "child" of b.2.C)
I wrote up a chunky little parser that does the deed, but I've run into complications:
- It appears that some text sections also have "lists", which are denoted by sections starting (N) where N is a decimal number. These lists shouldn't be pulled out, but the parser can't distinguish them a subsection if they fall in the wrong spot.
- I currently "fudge" roman numeral i (to distinguish it from the letter "i"), and I'm worried that as soon as my parser hits the text, it will break.
As far as I can tell, the best way to deal with this is to use a real parser that will evaluate the entire text rather than considering each line as mostly distinct as I do now. Is this a task for Parse::RecDescent? The documentation really seems to assume experience with parsers, does anyone have a good starting point? Has anyone done anything similar to this?
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.