Re: Very specific HTML parsing question

Ordinarily I'd say this is a job for HTML::TokeParser. What you describe would be very simple to implement using it, and it would be fairly easy to fix if it breaks when MSWord's HTML output changes in some bizarre way in a future edition (extremely likely to happen). There's even a tutorial here about it, which should make it that much easier to figure out.

On the other hand, it seems somewhat churlish to tell you to bring in the whole HTML::Parser suite just for the last 1% of your code... but actually, I'm going to. The reason is this: I could try to write the regex you need (though it would take longer than writing it with TokeParser), but I would probably fail. Several people would then respond with corrections explaining what I had missed, and that I was stupid to try to use a regex instead of TokeParser. And they would be right. Somebody might actually supply a regex that would do what you want, but

it would probably end up being fairly long and painful to read, and
they'd probably finish by saying you shouldn't use it, you should use HTML::TokeParser.

The upside is that you may look at TokeParser and realize that it could vastly simplify your script to use it in some other places--this is, after all, what CPAN modules are best at. :-)

You should also look at the base HTML::Parser module, just to see if the model it uses for parsing makes more sense to you--both systems have their advocates.

If God had meant us to fly, he would *never* have given us the railroads.
--Michael Flanders

Comment on Re: Very specific HTML parsing question