murugu has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks

while doing text processing i want to match the content inside the <P> tag, but i dont want match <P> tags inside which <b> tag is present.

<P>Just another perl hacker</P> <P>Just <b>another</b> perl hacker</P>

For the above code I want an regular expression which matches the first line and not the second.

Many thanks in advance

--Murugesan--

Replies are listed 'Best First'.
Re: Regular expression matching
by matija (Priest) on Mar 17, 2004 at 10:03 UTC
    I don't think you want to do that with one regular expression. If I had to do it with regular expressions, I would first match text inside paragraphs, and then discard all the paragraphs that had <b> in them.

    However, parsing HTML with regular expressions is an exercise in frustration. What happens if you have a newline in the tag? What happens if you have one in the paragraph? By the time you've resolved all those problems, you've written the better part of a HTML parser.

    You'd be much better off using HTML::Parser, or HTML::TokeParser::Simple.

Re: Regular expression matching
by Corion (Patriarch) on Mar 17, 2004 at 10:15 UTC

    Regular expressions are good for many things, but parsing HTML with them is not easy if your problem space goes beyond the trivial.

    To solve your problem as you told above with regular expressions, use the following:

    use strict; while (<DATA>) { chomp; my $match = "'$1'" if m#^<p>((?:[^<]*|<(?!b>)[^>]+>)*?)</p>$#ism; $match = "<nothing>" unless defined $match; print "$_ matches $match\n" }; __DATA__ <p>this should match</p> <p>This should <b>not</b> match</p> <p>What about <a href="http://www.example.com">this</a>?</p> <p>And <p>this</p> malformed piece?</p>

    But in general, you will be better off by looking at the various HTML parsers, for example HTML::TokeParser::Simple, or by looking at the modules to strip HTML, like HTML::TagStripper.

    If you are interested in extracting specific text out of webpages, there also is a variety of modules to use. Personally, I like XML::LibXML very much because XPath is a very convenient way to extract text. The following XPath expression finds all p tags that do not contain any b tag as a child, and returns the text:

    //p[not descendant::b()]/text()
Re: Regular expression matching
by Hena (Friar) on Mar 17, 2004 at 10:23 UTC
    Well this i think is more readable (IF in same line):
    if (m#<P>(.+)</P># && $1=~m/<b>/) {}
    But one regex could be this (i'm not sure which would be faster):
    m#<P>(?!.+<b>)(.+)</P>#i
    But as people noted above, html can be tricky since tags ba span multiple lines or be in same line. So be careful when doing pure regex handling on html.
      m#<P>(?!.+<b>)(.+)</P>#i
      Nope. That would fail to match on a string like "<P>This is fine.</P><P><b>next</b></P>".

      The next ought to work. It does a lookahead for the bad string "<b>" on every character it reconsiders matching, on the internal string.

      m#<P>(?:(?!<b>).)+?</P>#i
        True.

        Thats why i said that regex in pure html is not easy. I just had to go to eat, so didn't have time to refine it so that it does what you wrote :). I might have gone on route of the first way (more understandle and perhaps faster) eg.
        if (m#<P>(.+?)</P># && $1!~m/<b>/i)
        Oh, well ;).
        I submit:
        qr %<p>([^<]*(?:<(?!b>|/p>)[^<]*)*)</p>%i;
        and a benchmark. The solution with the lookahead on each character does much better than I expected:
        #!/usr/bin/perl use strict; use warnings; use Benchmark qw /cmpthese/; our $abigail = qr %<p>([^<]*(?:<(?!b>|/p>)[^<]*)*)</p>%i; our $bart = qr %<P>((?:(?!<b>).)+?)</P>%is; our $corion = qr %<p>((?:[^<]*|<(?!b>)[^>]+>)*?)</p>%i; my @names = qw /abigail bart corion/; our @data = <DATA>; my @correct = ('Just another perl hacker', 'this should match', 'What about <a href="http://www.example.com">this</a>?' +, 'And <p>this'); cmpthese -1 => {map {$_ => "\@$_ = map {/\$$_/g} \@data"} @names}; no strict 'refs'; "@$_" eq "@correct" or die ucfirst for @names; __DATA__ <P>Just another perl hacker</P> <P>Just <b>another</b> perl hacker</P> <p>this should match</p> <p>This should <b>not</b> match</p> <p>What about <a href="http://www.example.com">this</a>?</p> <p>And <p>this</p> malformed piece?</p> Rate corion bart abigail corion 6457/s -- -68% -72% bart 19910/s 208% -- -14% abigail 23209/s 259% 17% --

        Abigail