Okay, there have been references to HTML::Parser here, but they have all avoided the "easy" solution with that route. Most fall into traps and do not fully address your original question. The original question, as I understand it, involves targeting specific text segments within a certain zone in an HTML document, ignoring embedded HTML tags within that zone, and then ceasing extraction beyond that target zone in the HTML document.

HTML::Parser has recently evolved, and I invite you to check out the new syntax, but for the time being I will stick to the "old" syntax, which is still quite compatible with the current release.

Use the begin() and end() callback methods to keep track of your context. Use the text() method to slurp text segments into a cache, up until your end() method ceases the requirement condition for that zone. end() will also dump the cache when a zone scan is complete.

This simultaneously keeps track of your context, as well as neatly extracting your text from further embedded tags. Regexp be damned -- they only need to be applied to the cache result, not the HTML.

I am not involved with the development of HTML::Parser, but I do use the module extensively.

Mojotoad


In reply to Re: Search and replace everything except html tags by mojotoad
in thread Search and replace everything except html tags by thatguy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.