comment on

There's usually lots of response to regexp questions. Hope this one is no exception ...

I need to parse text files that use "business" type indented paragraph numbering, with numerals, lower case roman numerals and lower case letters, e.g. 1.ii.a.

The catch is that the paragraphs in the (many) documents I need to parse do not have the paragraph number in full, they are like this:

1. First paragraph text
  i. first sub para text
  ii. second sub para text
    a. first sub-sub para text.
  iii. third sub para text.
2. Second paragraph text
etc...
[download]

I need to parse this file, identify the numbers, recreate the full number and use it as the key to store the text of the paragraph in a hash, e.g. "1.ii.a" would be the hash key for "first sub-sub para text".

I don't think this can be done without some sort of state machine or counter arrangement. My efforts on regexps alone fail when trying to tell the difference between sub para "i.", (i.e. before sub-para ii.) and sub-sub para "i.", i.e. betweeen sub-sub para "h." and sub-sub para "j.".

Indenting cannot be guaranteed as a way of determining nesting level, the full stops after the numbers sometimes get omitted and not all the paras have a number/letter - a para without a number is to be treated as being part of the last paragraph that had a number/letter.

All pointers to code/modules and hints gratefully received.

Thanks, AndyH

In reply to Matching indented paragraph numbering with regexps by AndyH

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.