in reply to Search and replace everything except html tags
Okay, there have been references to HTML::Parser here, but they have all avoided the "easy" solution with that route. Most fall into traps and do not fully address your original question. The original question, as I understand it, involves targeting specific text segments within a certain zone in an HTML document, ignoring embedded HTML tags within that zone, and then ceasing extraction beyond that target zone in the HTML document.
HTML::Parser has recently evolved, and I invite you to check out the new syntax, but for the time being I will stick to the "old" syntax, which is still quite compatible with the current release.
Use the begin() and end() callback methods to keep track of your context. Use the text() method to slurp text segments into a cache, up until your end() method ceases the requirement condition for that zone. end() will also dump the cache when a zone scan is complete.
This simultaneously keeps track of your context, as well as neatly extracting your text from further embedded tags. Regexp be damned -- they only need to be applied to the cache result, not the HTML.
I am not involved with the development of HTML::Parser, but I do use the module extensively.
Mojotoad
|
|---|