in reply to Stripping HTML tags

EDIT: The following has been modified to take care of nested tags and DOCTYPE declarations. It should work fairly well now. However, as has been pointed out to me via PM, I probably shouldn't be suggesting regex solutions for a job that modules have already been designed for. You can shoot me now.

--------------

The following may do what you want:

$_ = join '', <DATA>; while(s/<(?:\/?\w|!)[^<>]*>/ /sg) {} s/ +/ /g; s/^ | $//mg; print; __DATA__ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" ""> Once upon a time there was a <a href="page.html">link</a> and some <b>bold text</b> and a paragraph break<p> <!-- invisible <nested tag> --> and a <table cellspacing="0" cellpadding="0" border="0"><tr> <td>table</td> </tr></table> 4 < 5 > 3
This doesn't convert things like &nbsp;, of course, but you can add code for that on your own fairly easily. Note that the above mostly preserves page structure - you may want something more like the following if you're just trying to export the text:
$_ = join '', <DATA>; while (s/<(?:\/?\w|!)[^<>]*>/ /sg) {} s/\s+/ /g; s/^ | $//; print;

Replies are listed 'Best First'.
Re^2: Stripping HTML tags
by fishbot_v2 (Chaplain) on May 24, 2005 at 20:19 UTC

    This is a nice simple solution, but it really depends on how robust the OP needs the solution to be. If they simply have a bunch of files they want to strip and then hand edit, this is excellent. If it needs to work unsupervised, then HTML::Strip or HTML::Parser might be a bit better.

    Two issue that immediately come to mind are - tags nested inside comments wouldn't strip correctly, and things like script or style tags would be poorly handled. Also, DocType declarations are missed. Suggest:

    s/<!-- .*? -->//xsg; s/<(script|style)[^>]*> .*? <\/\1[^>]*>//xsg; s/(?: <[^<>]*> )+/ /xsg; # ...

    Still not terribly robust, but possibly sufficient.

      Allow me to translate your answers. This solution is broken, but you have low standards for the quality of code, and don't mind giving people broken solutions without informing them of how they are broken.
        1. I didn't provide the solution, I commented on it
        2. I don't have low standards for code - I suggested a different solution space (HTML::Strip or HTML::Parser) if this wasn't a one-off. Not all one-offs need to be robust.
        3. I gave an example of three places where the solution was "broken" - script and style tags, tags nested in comments (illegal but common) and doctype declarations.
        4. I make a suggestions for fixing those three specific issues.

        Your point is taken, though - I don't say why I still don't think that the ammended solution is robust. Parsing HTML with a series of regexes is slow and difficult. style tags don't necessarily have endtags, for example: They could simply have a link to a .js file. Then, much later in the HTML document, if there was a closing script tag for another block, it would swallow and delete the enclosed valid content.

        For performance, HTML::Stripper is an XS module, so it would be much, much faster than the multi-pass regex approach.