Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Help!!!

Anyone knows how to strip all the html code from a html page, I need to store the data between the tags into another format and the html code is my problem, any help?

Thank you all!

Replies are listed 'Best First'.
Re: Stripping HTML tags
by thundergnat (Deacon) on May 24, 2005 at 15:47 UTC

    It seems way too obvious, but would HTML::Strip be what you are looking for?

Re: Stripping HTML tags
by blazar (Canon) on May 24, 2005 at 15:56 UTC
    Well, maybe the FAQ knows... see e.g. perldoc -q HTML. Well, it may well be that the answers in the FAQ are not suitable for your needs, but you always check it before posting here, don't you? ;-)
Re: Stripping HTML tags
by TedPride (Priest) on May 24, 2005 at 17:10 UTC
    EDIT: The following has been modified to take care of nested tags and DOCTYPE declarations. It should work fairly well now. However, as has been pointed out to me via PM, I probably shouldn't be suggesting regex solutions for a job that modules have already been designed for. You can shoot me now.

    --------------

    The following may do what you want:

    $_ = join '', <DATA>; while(s/<(?:\/?\w|!)[^<>]*>/ /sg) {} s/ +/ /g; s/^ | $//mg; print; __DATA__ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" ""> Once upon a time there was a <a href="page.html">link</a> and some <b>bold text</b> and a paragraph break<p> <!-- invisible <nested tag> --> and a <table cellspacing="0" cellpadding="0" border="0"><tr> <td>table</td> </tr></table> 4 < 5 > 3
    This doesn't convert things like &nbsp;, of course, but you can add code for that on your own fairly easily. Note that the above mostly preserves page structure - you may want something more like the following if you're just trying to export the text:
    $_ = join '', <DATA>; while (s/<(?:\/?\w|!)[^<>]*>/ /sg) {} s/\s+/ /g; s/^ | $//; print;

      This is a nice simple solution, but it really depends on how robust the OP needs the solution to be. If they simply have a bunch of files they want to strip and then hand edit, this is excellent. If it needs to work unsupervised, then HTML::Strip or HTML::Parser might be a bit better.

      Two issue that immediately come to mind are - tags nested inside comments wouldn't strip correctly, and things like script or style tags would be poorly handled. Also, DocType declarations are missed. Suggest:

      s/<!-- .*? -->//xsg; s/<(script|style)[^>]*> .*? <\/\1[^>]*>//xsg; s/(?: <[^<>]*> )+/ /xsg; # ...

      Still not terribly robust, but possibly sufficient.

        Allow me to translate your answers. This solution is broken, but you have low standards for the quality of code, and don't mind giving people broken solutions without informing them of how they are broken.