Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am removing HTML tags by hand . . . though it's 90% a set of really neat regexes that I found online (and yes, it does work with comments, etc.). The only problem is that it leaves tags with non-visible text in the middle of an opening and closing tag (such as STYLE, OPTION, SCRIPT, and TEXTAREA). I made the following regex:

$textonly =~ s#<(script|style|option|textarea)[^>]*>[^(</\1)]*</\1[^>]*># #gi;

But it doesn't work; I think it's because scripts have a < in it when it's not really done. The problem is in the middle: [^(</\1)]* . . . I'm sure you know what I meant. But how do I do this the right way in a regex? (That is, something to the exent of "continue until you see a </\1".) Thanks!

Replies are listed 'Best First'.
Re: My Regex Won't Work . . .
by Roger (Parson) on Mar 03, 2004 at 03:13 UTC
    You should have a look at the HTML::Strip module, it does a good job at stripping HTML tags.

    As for your immediate problem...
    my $str = do { local $/; <DATA> }; $str =~ s!<(script|style|option|textarea)[^>]*?>.*?</\1[^>]*?>!!gsi; # In fact when you turn on the non-greedy match with '?', you # don't need the [^>] either... So the following regex works # equally well: # $str =~ s!<(script|style|option|textarea).*?>.*?</\1.*?>!!gsi; print "$str\n"; __DATA__ <script> This is removed. </script> This is not removed.

Re: My Regex Won't Work . . .
by Anonymous Monk on Mar 03, 2004 at 03:20 UTC
    Works great. Thanks.

    For the record, I'm doing this by hand because: 1) Nothing did exactly what I wanted . . . I'm doing a little more than extracting text. 2) I found a series of really big regexes that do the same thing, and it was easy to add. 3) The script is going to be embedded into C++ on a machine that may or may not have Perl, so modules are a pain.