Excluding groups of characters in regular expressions

semirhage has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to remove any nested paragraph tags from a huge file.

I have come up with the following regex so far...

s/<paragraph>(.*?)<paragraph>(.*?)<\/paragraph>(.*?)<\/paragraph>/<paragraph>$1$2$3<\/paragraph>/ig;

This appears to work in many cases however it also removes every other pair of properly formatted paragraph tags...

e.g.

If the input was the following:

<paragraph>some data</paragraph><paragraph>more data</paragraph><paragraph>even more data</paragraph>

The regex would result in this being changed to:

<paragraph>some data</paragraph>more data<paragraph>even more data</paragraph>

After thinking about it, it makes sense since I am trying to match to four tag units within the text and the (.*?) doesn't exclude other paragraph tags from being included...

Is there anyway to exclude <paragraph> or </paragraph> from the (.*?) matches?

Thanks...

Tom

Comment on Excluding groups of characters in regular expressions

Replies are listed 'Best First'.
Re: Excluding groups of characters in regular expressions by suaveant (Parson) on Dec 21, 2007 at 14:53 UTC
You'd be better off treating this more as a parser problem than a regexp problem, because trying to match nested items in regexp is advanced stuff... here is some code that should get you started... not ideal but easy :) `$text = "<paragraph>some <paragraph>some data</paragraph>data</paragra +ph><paragraph>more data</paragraph><paragraph>even more data</paragra +ph>"; my $depth = 0; $text =~ s{(<(/)?paragraph>)}{check($depth)}gie; print "$text\n"; sub check { if($2) { $_[0]--; if($_[0] == 0) { return $1; } $_[0] = 0 if $_[0] < 0; } else { $_[0]++; return $1 if $_[0] == 1; } return ''; }` [download] - Ant - Some of my best work - (1 2 3)	[reply] [d/l]
Re^2: Excluding groups of characters in regular expressions by semirhage (Initiate) on Dec 21, 2007 at 16:19 UTC
A friend of mine was able to help me... here is the answer for anybody who needs help with this sort of thing in the future. `$_='$0<paragraph>$1<paragraph>$2</paragraph>$3<paragraph>$4<paragraph> +$5</paragraph>$6</paragraph>$7</paragraph>$8<paragraph>$9</paragraph> +$10<paragraph>$11</paragraph>$12 '; ($re=$_)=~s/((<paragraph>)\|(<\/paragraph>)\|[^<]+\|.)/${[')','']}[!$3]\Q +$1\E${['(','']}[!$2]/gs; $re=join"\|",map{quotemeta}(eval{/$re/}); s{($re)}{local $_=$1;s#</?paragraph>##g;$_}eg; print;` [download] Tom	[reply] [d/l]
Re^3: Excluding groups of characters in regular expressions by suaveant (Parson) on Dec 21, 2007 at 19:52 UTC
Have you ever heard anyone discussing your friend's code before... and if so, was there a lot of swearing involved? No offense, but the code looks like an entry to an obfuscation contest. Not so bad if its a one off script, but barely maintainable if it's going to be around for a bit. - Ant - Some of my best work - (1 2 3)	[reply]
Re: Excluding groups of characters in regular expressions by Jaap (Curate) on Dec 21, 2007 at 14:50 UTC
That kind of thing is pretty hard with regexes so people usually recommend using a real parser like HTML::Parser	[reply]