kiat has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I need a regex to remove "bad" tags and keep the good ones from a string.

Here are some examples of what are considered good tags (assuming each of them is an input string):

[b]bold text[/b] [color=Red]Red text text[/color] [color=Red][b]Red bold text[/b][/color]
I need to remove the ones below:
[b][/b] [color=Red][/color] [color=Red][b][/b][/color]
That is, without any text in between the tags.

Is there an easy way to do that?

Thanks in advance :)

Replies are listed 'Best First'.
Re: Regex help
by tachyon (Chancellor) on Jul 31, 2004 at 12:17 UTC

    Regexes have limitations with stream parsing but something like this will probably do the trick. You need the 1 while as in the last example the first pass removes the inner null tag pair and the second pass removes the enclosing pair now that the inner null pair has been replaced by a null string. By nothing between I have taken that to be literal. I have also assumed that the general syntax is [blah...]...[/blah].

    The capture into $1 (\1) assumes there is no space between the [ and the blah.... If it is not well formed (ie humans, not a machine generated it) it is generally useful to put lots of \s* tokens into the RE so that it can deal with [ blah=foo]    [ /  blah]

    local $/; $_ = <DATA>; 1 while s#\[(\w+)[^\]]*\]\[/\1\]##ig; print; __DATA__ [b]bold text[/b] [color=Red]Red text text[/color] [color=Red][b]Red bold text[/b][/color] I need to remove the ones below: [b][/b] [color=Red][/color] [color=Red][b][/b][/color]

    cheers

    tachyon

      You saved my day, tachyon! Lots of thanks!
Re: Regex help
by Dietz (Curate) on Jul 31, 2004 at 12:27 UTC
    Following regex should do the trick:

    .+(?<=\])(.+?)(?=\[\/).+

    Sample code:
    #!/usr/bin/perl -w use strict; my $bold = "[b]bold text[/b]"; my $red = "[color=Red]Red text text[/color]"; my $red_bold = "[color=Red][b]Red bold text[/b][/color]"; my $regex = qr/.+(?<=\])(.+?)(?=\[\/).+/; $bold =~ s/$regex/$1/; $red =~ s/$regex/$1/; $red_bold =~ s/$regex/$1/; print "\$bold: $bold\n"; print "\$red: $red\n"; print "\$red_bold: $red_bold\n"; __END__ __OUTPUT__ $bold: bold text $red: Red text text $red_bold: Red bold text

      The 1 while resursive subsitution trick is useful for this sort of problem. See my example above. I prefer a negated char class ie [^\]] in this example to an un-greedy .+? as it saves backtracking +/- improves accuracy as it is slightly more specific and it allows \n for example where . does not by default.

      Lots of ways to skin the cat, provided we can make a nice tasty stew TIMTOWDI.

      cheers

      tachyon

        I'd guess that your regex is still going to do a fair amount of backtracking. I'd say (?>(\w+))[^\]]* or (\w+)(=[^\]]*)? (untested).

        Update: this isn't just a backtracking issue; tachyon's original regex will match things like [color=Red][/col].

      Thanks, Dietz!

      I ran your code. It doesn't completely remove the following bad tags:

      my $empty = "[color=Red][b][/b][/color]";
        Sorry kiat, seems I completely misunderstood the task
        Here's another go, though tachyon's solution is excellent:
        #!/usr/bin/perl -w use strict; my $bold = "[b]bold text[/b]"; my $red = "[color=Red]Red text text[/color]"; my $red_bold = "[color=Red][b]Red bold text[/b][/color]"; my $empty = "[color=Red][b][/b][/color]"; &check_tags($bold); &check_tags($red); &check_tags($red_bold); &check_tags($empty); sub check_tags { my $tag = shift; print $tag, $/ if $tag =~ /(?:\[[^\]]+\])+.+?(?<!\])(?:\[\/).+/; } __END__ __OUTPUT__ [b]bold text[/b] [color=Red]Red text text[/color] [color=Red][b]Red bold text[/b][/color]