in reply to Regex bafflement
I usually process stuff like that out with HTML::Tidy. See also options --bare and --clean. Once you have sane HTML, further processing gets much easier.
Update: Word HTML to TWiki converter may also be of interest.
HTH,
planetscape
|
|---|