in reply to Regex bafflement

I usually process stuff like that out with HTML::Tidy. See also options --bare and --clean. Once you have sane HTML, further processing gets much easier.

Update: Word HTML to TWiki converter may also be of interest.

HTH,

planetscape