semirhage has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to remove any nested paragraph tags from a huge file.

I have come up with the following regex so far...

s/<paragraph>(.*?)<paragraph>(.*?)<\/paragraph>(.*?)<\/paragraph>/<paragraph>$1$2$3<\/paragraph>/ig;

This appears to work in many cases however it also removes every other pair of properly formatted paragraph tags...

e.g.

If the input was the following:

<paragraph>some data</paragraph><paragraph>more data</paragraph><paragraph>even more data</paragraph>

The regex would result in this being changed to:

<paragraph>some data</paragraph>more data<paragraph>even more data</paragraph>

After thinking about it, it makes sense since I am trying to match to four tag units within the text and the (.*?) doesn't exclude other paragraph tags from being included...

Is there anyway to exclude <paragraph> or </paragraph> from the (.*?) matches?

Thanks...

Tom
  • Comment on Excluding groups of characters in regular expressions

Replies are listed 'Best First'.
Re: Excluding groups of characters in regular expressions
by suaveant (Parson) on Dec 21, 2007 at 14:53 UTC
    You'd be better off treating this more as a parser problem than a regexp problem, because trying to match nested items in regexp is advanced stuff... here is some code that should get you started... not ideal but easy :)
    $text = "<paragraph>some <paragraph>some data</paragraph>data</paragra +ph><paragraph>more data</paragraph><paragraph>even more data</paragra +ph>"; my $depth = 0; $text =~ s{(<(/)?paragraph>)}{check($depth)}gie; print "$text\n"; sub check { if($2) { $_[0]--; if($_[0] == 0) { return $1; } $_[0] = 0 if $_[0] < 0; } else { $_[0]++; return $1 if $_[0] == 1; } return ''; }

                    - Ant
                    - Some of my best work - (1 2 3)

      A friend of mine was able to help me... here is the answer for anybody who needs help with this sort of thing in the future.

      $_='$0<paragraph>$1<paragraph>$2</paragraph>$3<paragraph>$4<paragraph> +$5</paragraph>$6</paragraph>$7</paragraph>$8<paragraph>$9</paragraph> +$10<paragraph>$11</paragraph>$12 '; ($re=$_)=~s/((<paragraph>)|(<\/paragraph>)|[^<]+|.)/${[')','']}[!$3]\Q +$1\E${['(','']}[!$2]/gs; $re=join"|",map{quotemeta}(eval{/$re/}); s{($re)}{local $_=$1;s#</?paragraph>##g;$_}eg; print;


      Tom
        Have you ever heard anyone discussing your friend's code before... and if so, was there a lot of swearing involved?

        No offense, but the code looks like an entry to an obfuscation contest. Not so bad if its a one off script, but barely maintainable if it's going to be around for a bit.

                        - Ant
                        - Some of my best work - (1 2 3)

Re: Excluding groups of characters in regular expressions
by Jaap (Curate) on Dec 21, 2007 at 14:50 UTC
    That kind of thing is pretty hard with regexes so people usually recommend using a real parser like HTML::Parser