Brian has asked for the wisdom of the Perl Monks concerning the following question:

Hello to all in Robes,
I need to strip all the HTML tags from a flat text input. I am guessing that the best way to do this is using regular expressions. I have a rudimentary understanding of reg ex's, but to be honest I need this fast and I haven't really got the time to work it out for myself.
As I know that there are those out there that just love reg ex's so, hopefully there's someone with the answer to my prayers.
Thanks in anticipation,

Brian (accolyte of the lowest kind)
  • Comment on Stripping HTML tags with Regular Expressions.

Replies are listed 'Best First'.
Re: Stripping HTML tags with Regular Expressions.
by jeroenes (Priest) on Oct 04, 2001 at 19:18 UTC
    A few comments:

    1. Regexes and HTML don't like each other
    2. CPAN is our valued friend. Make it your friend too, will save you lots of time.
    3. eg, see HTML::Parser
    4. see also 7 Stages of Regex Users
    5. Please don't use this: s/<[^>]*>//g as it is a very bad idea. See previous point.
Re: Stripping HTML tags with Regular Expressions.
by mirod (Canon) on Oct 04, 2001 at 19:29 UTC

    You are guessing wrong! It is a FAQ and perldoc -q html would have given you answers right away and would have given you plenty of examples of non-trivial HTML that breaks "naive" regexp's.

    A search on strip tags on this site also returns a whole bunch of answers. Some of them even make sense: Re: Strip HTML tags use HTML::TreeBuilder, and another one would have lead you to Using HTML::Parser - a quick guide which uses... HTML::Parser.

    This would have been fastest than asking here BTW...

Re: Stripping HTML tags with Regular Expressions.
by petdance (Parson) on Oct 04, 2001 at 19:20 UTC
    Hey, we always "need it fast". Nobody programs leisurely.

    I'd suggest you use that little search box at the top of the screen and search for, say, "strip html tags".

    Truly, the fastest answer is one that already exists.

    xoxo,
    Andy
    --
    <megaphone> Throw down the gun and tiara and come out of the float! </megaphone>

Re: Stripping HTML tags with Regular Expressions.
by dlc (Acolyte) on Oct 04, 2001 at 20:22 UTC
Re: Stripping HTML tags with Regular Expressions.
by steves (Curate) on Oct 04, 2001 at 20:15 UTC

    regexp seems the way to go at first until a day later you're still handling all the complex cases you forgot about.

    Better to use something like HTML::Parser that does it right. HTML::Parser used to have an example that stripped all tags.

    I do use regexp's for cases that are predictable -- e.g., we have some text that's known to contain only specific markup that's easily handled with regexp's. Not for general HTML tag stripping though.

A reply falls below the community's threshold of quality. You may see it by logging in.