There's usually lots of response to regexp questions. Hope this one is no exception ...
I need to parse text files that use "business" type indented paragraph numbering, with numerals, lower case roman numerals and lower case letters, e.g. 1.ii.a.
The catch is that the paragraphs in the (many) documents I need to parse do not have the paragraph number in full, they are like this:
1. First paragraph text i. first sub para text ii. second sub para text a. first sub-sub para text. iii. third sub para text. 2. Second paragraph text etc...
I need to parse this file, identify the numbers, recreate the full number and use it as the key to store the text of the paragraph in a hash, e.g. "1.ii.a" would be the hash key for "first sub-sub para text".
I don't think this can be done without some sort of state machine or counter arrangement. My efforts on regexps alone fail when trying to tell the difference between sub para "i.", (i.e. before sub-para ii.) and sub-sub para "i.", i.e. betweeen sub-sub para "h." and sub-sub para "j.".
Indenting cannot be guaranteed as a way of determining nesting level, the full stops after the numbers sometimes get omitted and not all the paras have a number/letter - a para without a number is to be treated as being part of the last paragraph that had a number/letter.
All pointers to code/modules and hints gratefully received.
Thanks, AndyH
In reply to Matching indented paragraph numbering with regexps by AndyH
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |