AndyH has asked for the wisdom of the Perl Monks concerning the following question:

There's usually lots of response to regexp questions. Hope this one is no exception ...

I need to parse text files that use "business" type indented paragraph numbering, with numerals, lower case roman numerals and lower case letters, e.g. 1.ii.a.

The catch is that the paragraphs in the (many) documents I need to parse do not have the paragraph number in full, they are like this:

1. First paragraph text i. first sub para text ii. second sub para text a. first sub-sub para text. iii. third sub para text. 2. Second paragraph text etc...

I need to parse this file, identify the numbers, recreate the full number and use it as the key to store the text of the paragraph in a hash, e.g. "1.ii.a" would be the hash key for "first sub-sub para text".

I don't think this can be done without some sort of state machine or counter arrangement. My efforts on regexps alone fail when trying to tell the difference between sub para "i.", (i.e. before sub-para ii.) and sub-sub para "i.", i.e. betweeen sub-sub para "h." and sub-sub para "j.".

Indenting cannot be guaranteed as a way of determining nesting level, the full stops after the numbers sometimes get omitted and not all the paras have a number/letter - a para without a number is to be treated as being part of the last paragraph that had a number/letter.

All pointers to code/modules and hints gratefully received.

Thanks, AndyH

Replies are listed 'Best First'.
Re: Matching indented paragraph numbering with regexps
by Hofmator (Curate) on Apr 01, 2004 at 12:57 UTC
    Watch Text::Autoformat at work:
    use Text::Autoformat; my $text = <<EOT; 1. First paragraph text i. first sub para text iii. second sub para text a. first sub-sub para text. a. second sub-sub para text. iii. third sub para text. 2. Second paragraph text EOT print autoformat $text, {all=>1}; __END__ 1. First paragraph text i. first sub para text ii. second sub para text a. first sub-sub para text. b. second sub-sub para text. iii. third sub para text. 2. Second paragraph text

    So it is possible, but considering that Text::Autoformat is a TheDamian module, the reading will not be easy on the eye/brain ;-))

    -- Hofmator

Re: Matching indented paragraph numbering with regexps
by Abigail-II (Bishop) on Apr 01, 2004 at 12:58 UTC
    Given the lack of restrains in the text to parse, this might be impossible to get right all the time. You already pointed out the difficulty (or rather, impossibleness) of determining whether i is a Roman numeral, or a Latin letter. To complicate things further, both i and a are ordinary English words. vi is the name of an editor, and xi is an uncommon, but not impossible, English word. And so is li.

    Abigail

      You could restrict the numerals to be the first not-whitespace on a line and have to end with a point.

      But nevertheless you could do a failure, perhaps if the fifth sub point ends with the word "vi." in a single line:

      iv. ...emacs ....
      v. ......... ends width the word
      vi.
      vi. ....
      vii ....

      But with this restrictions it's even harder to fail. Additional you could try to stepp back, if you find double numerals. But even that does not help if the point v. in the example above was the last one.

      I think the "right" solution depends much on your data and the demands at (to?) your program.