Stripping HTML tags

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Stripping HTML tags by thundergnat (Deacon) on May 24, 2005 at 15:47 UTC
It seems way too obvious, but would HTML::Strip be what you are looking for?	[reply]
Re: Stripping HTML tags by blazar (Canon) on May 24, 2005 at 15:56 UTC
Well, maybe the FAQ knows... see e.g. `perldoc -q HTML`. Well, it may well be that the answers in the FAQ are not suitable for your needs, but you always check it before posting here, don't you? ;-)	[reply] [d/l]
Re: Stripping HTML tags by TedPride (Priest) on May 24, 2005 at 17:10 UTC
EDIT: The following has been modified to take care of nested tags and DOCTYPE declarations. It should work fairly well now. However, as has been pointed out to me via PM, I probably shouldn't be suggesting regex solutions for a job that modules have already been designed for. You can shoot me now. -------------- The following may do what you want: `$_ = join '', <DATA>; while(s/<(?:\/?\w\|!)[^<>]>/ /sg) {} s/ +/ /g; s/^ \| $//mg; print; __DATA__ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" ""> Once upon a time there was a <a href="page.html">link</a> and some <b>bold text</b> and a paragraph break<p> <!-- invisible <nested tag> --> and a <table cellspacing="0" cellpadding="0" border="0"><tr> <td>table</td> </tr></table> 4 < 5 > 3` [download] This doesn't convert things like ` `, of course, but you can add code for that on your own fairly easily. Note that the above mostly preserves page structure - you may want something more like the following if you're just trying to export the text: `$_ = join '', <DATA>; while (s/<(?:\/?\w\|!)[^<>]>/ /sg) {} s/\s+/ /g; s/^ \| $//; print;` [download]	[reply] [d/l] [select]
Re^2: Stripping HTML tags by fishbot_v2 (Chaplain) on May 24, 2005 at 20:19 UTC
This is a nice simple solution, but it really depends on how robust the OP needs the solution to be. If they simply have a bunch of files they want to strip and then hand edit, this is excellent. If it needs to work unsupervised, then HTML::Strip or HTML::Parser might be a bit better. Two issue that immediately come to mind are - tags nested inside comments wouldn't strip correctly, and things like script or style tags would be poorly handled. Also, DocType declarations are missed. Suggest: `s/<!-- .? -->//xsg; s/<(script\|style)[^>]> .? <\/\1[^>]>//xsg; s/(?: <[^<>]*> )+/ /xsg; # ...` [download] Still not terribly robust, but possibly sufficient.	[reply] [d/l]
Re^3: Stripping HTML tags by tilly (Archbishop) on May 24, 2005 at 23:49 UTC
Allow me to translate your answers. This solution is broken, but you have low standards for the quality of code, and don't mind giving people broken solutions without informing them of how they are broken.	[reply]
Re^4: Stripping HTML tags by fishbot_v2 (Chaplain) on May 25, 2005 at 00:55 UTC
Re^5: Stripping HTML tags by tilly (Archbishop) on May 25, 2005 at 01:01 UTC