Given the lack of restrains in the text to parse, this might
be impossible to get right all the time. You already pointed
out the difficulty (or rather, impossibleness) of determining
whether i is a Roman numeral, or a Latin letter.
To complicate things further, both i and a
are ordinary English words. vi is the name of an
editor, and xi is an uncommon, but not impossible,
English word. And so is li.
Abigail
Comment on Re: Matching indented paragraph numbering with regexps
You could restrict the numerals to be the first not-whitespace on a line and have to end with a point.
But nevertheless you could do a failure, perhaps if the fifth sub point ends with the word "vi." in a single line:
iv. ...emacs ....
v. ......... ends width the word
vi.
vi. ....
vii ....
But with this restrictions it's even harder to fail. Additional you could try to stepp back, if you find double numerals. But even that does not help if the point v. in the example above was the last one.
I think the "right" solution depends much on your data and the demands at (to?) your program.