Regex help

kiat has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I need a regex to remove "bad" tags and keep the good ones from a string.

Here are some examples of what are considered good tags (assuming each of them is an input string):

[b]bold text[/b]
[color=Red]Red text text[/color]
[color=Red][b]Red bold text[/b][/color]
[download]

I need to remove the ones below:

[b][/b]
[color=Red][/color]
[color=Red][b][/b][/color]
[download]

That is, without any text in between the tags.

Is there an easy way to do that?

Thanks in advance :)

Comment on Regex help Select or Download Code

Replies are listed 'Best First'.
Re: Regex help by tachyon (Chancellor) on Jul 31, 2004 at 12:17 UTC
Regexes have limitations with stream parsing but something like this will probably do the trick. You need the 1 while as in the last example the first pass removes the inner null tag pair and the second pass removes the enclosing pair now that the inner null pair has been replaced by a null string. By nothing between I have taken that to be literal. I have also assumed that the general syntax is `[blah...]...[/blah]`. The capture into $1 (\1) assumes there is no space between the [ and the blah.... If it is not well formed (ie humans, not a machine generated it) it is generally useful to put lots of \s* tokens into the RE so that it can deal with `[ blah=foo] [ / blah]` `local $/; $_ = <DATA>; 1 while s#\[(\w+)[^\]]*\]\[/\1\]##ig; print; __DATA__ [b]bold text[/b] [color=Red]Red text text[/color] [color=Red][b]Red bold text[/b][/color] I need to remove the ones below: [b][/b] [color=Red][/color] [color=Red][b][/b][/color]` [download] cheers tachyon	[reply] [d/l] [select]
Re^2: Regex help by kiat (Vicar) on Jul 31, 2004 at 12:25 UTC
You saved my day, tachyon! Lots of thanks!	[reply]
Re: Regex help by Dietz (Curate) on Jul 31, 2004 at 12:27 UTC
Following regex should do the trick: `.+(?<=\])(.+?)(?=\[\/).+` [download] Sample code: `#!/usr/bin/perl -w use strict; my $bold = "[b]bold text[/b]"; my $red = "[color=Red]Red text text[/color]"; my $red_bold = "[color=Red][b]Red bold text[/b][/color]"; my $regex = qr/.+(?<=\])(.+?)(?=\[\/).+/; $bold =~ s/$regex/$1/; $red =~ s/$regex/$1/; $red_bold =~ s/$regex/$1/; print "\$bold: $bold\n"; print "\$red: $red\n"; print "\$red_bold: $red_bold\n"; __END__ __OUTPUT__ $bold: bold text $red: Red text text $red_bold: Red bold text` [download]	[reply] [d/l] [select]
Re^2: Regex help by tachyon (Chancellor) on Jul 31, 2004 at 13:19 UTC
The 1 while resursive subsitution trick is useful for this sort of problem. See my example above. I prefer a negated char class ie `[^\]]` in this example to an un-greedy .+? as it saves backtracking +/- improves accuracy as it is slightly more specific and it allows \n for example where . does not by default. Lots of ways to skin the cat, provided we can make a nice tasty stew TIMTOWDI. cheers tachyon	[reply] [d/l]
Re^3: Regex help by ysth (Canon) on Aug 01, 2004 at 05:54 UTC
I'd guess that your regex is still going to do a fair amount of backtracking. I'd say `(?>(\w+))[^\]]` or `(\w+)(=[^\]])?` (untested). Update: this isn't just a backtracking issue; tachyon's original regex will match things like `[color=Red][/col]`.	[reply] [d/l] [select]
Re^4: Regex help by tachyon (Chancellor) on Aug 01, 2004 at 08:31 UTC
Re^2: Regex help by kiat (Vicar) on Jul 31, 2004 at 12:39 UTC
Thanks, Dietz! I ran your code. It doesn't completely remove the following bad tags: `my $empty = "[color=Red][b][/b][/color]";` [download]	[reply] [d/l]
Re^3: Regex help by Dietz (Curate) on Jul 31, 2004 at 14:35 UTC
Sorry kiat, seems I completely misunderstood the task Here's another go, though tachyon's solution is excellent: #!/usr/bin/perl -w use strict; my $bold = "[b]bold text[/b]"; my $red = "[color=Red]Red text text[/color]"; my $red_bold = "[color=Red][b]Red bold text[/b][/color]"; my $empty = "[color=Red][b][/b][/color]"; &check_tags($bold); &check_tags($red); &check_tags($red_bold); &check_tags($empty); sub check_tags { my $tag = shift; print $tag, $/ if $tag =~ /(?:\[[^\]]+\])+.+?(?<!\])(?:\[\/).+/; } __END__ __OUTPUT__ [b]bold text[/b] [color=Red]Red text text[/color] [color=Red][b]Red bold text[/b][/color] [download]	[reply] [d/l]