Re: huge multiline regex

There are too many variables involved with a free form markup such as HTML; white-space can fall in arbitrary places, including within tags, tag attributes can change, and even the markup can change without altering the intent of the underlying text. While regular expressions are great for pattern matching, what you're doing is going beyond pattern matching, to markup parsing. Regular expressions might comprise a portion of a full fledged markup parser, but they're not usually a complete solution.

You really ought to be using something more robust than a fragile regular expression approach. HTML::TokeParser and HTML::Parser are two possible alternatives, both of which can handle the intricate nuances of HTML. Regular expressions that handle all the possibilities are difficult to construct correctly, and fragile. An HTML parser is a more suitable tool for the job.

Dave

Comment on Re: huge multiline regex