Stripping HTML tags with Regular Expressions.

Brian has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Stripping HTML tags with Regular Expressions. by jeroenes (Priest) on Oct 04, 2001 at 19:18 UTC
A few comments: Regexes and HTML don't like each other CPAN is our valued friend. Make it your friend too, will save you lots of time. eg, see HTML::Parser see also 7 Stages of Regex Users Please don't use this: `s/<[^>]*>//g` as it is a very bad idea. See previous point.	[reply] [d/l]
Re: Stripping HTML tags with Regular Expressions. by mirod (Canon) on Oct 04, 2001 at 19:29 UTC
You are guessing wrong! It is a FAQ and `perldoc -q html` would have given you answers right away and would have given you plenty of examples of non-trivial HTML that breaks "naive" regexp's. A search on `strip tags` on this site also returns a whole bunch of answers. Some of them even make sense: Re: Strip HTML tags use HTML::TreeBuilder, and another one would have lead you to Using HTML::Parser - a quick guide which uses... HTML::Parser. This would have been fastest than asking here BTW...	[reply]
Re: Stripping HTML tags with Regular Expressions. by petdance (Parson) on Oct 04, 2001 at 19:20 UTC
Hey, we always "need it fast". Nobody programs leisurely. I'd suggest you use that little search box at the top of the screen and search for, say, "strip html tags". Truly, the fastest answer is one that already exists. xoxo, Andy -- <megaphone> Throw down the gun and tiara and come out of the float! </megaphone>	[reply]
Re: Stripping HTML tags with Regular Expressions. by dlc (Acolyte) on Oct 04, 2001 at 20:22 UTC
Take a look at Tom Christiansen's TPJ article, "HTML Hacking with Regular Expressions" (darren)	[reply]
Re: Stripping HTML tags with Regular Expressions. by steves (Curate) on Oct 04, 2001 at 20:15 UTC
regexp seems the way to go at first until a day later you're still handling all the complex cases you forgot about. Better to use something like HTML::Parser that does it right. HTML::Parser used to have an example that stripped all tags. I do use regexp's for cases that are predictable -- e.g., we have some text that's known to contain only specific markup that's easily handled with regexp's. Not for general HTML tag stripping though.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.