vitoco has asked for the wisdom of the Perl Monks concerning the following question:

Hello.

I'm trying to parse and HTML file that has many nested tags like <div> or <span>.

When I try this simple test:

#!perl $a = 'A<div>B<div>C</div>D</div>E'; $a =~ m%<div>(.*?)</div>%; print "=$1=\n";

the result is =B<div>C= instead of just =C=.

If I change (.*?) with (.*), the result is =B<div>C</div>D= as expected.

What is going on with the not greedy regexpr?

Is there a way to extract only the inner string?

Please note that capital letters from my example represent any HTML string, with other tags included.

Replies are listed 'Best First'.
Re: regular expression and nested tags
by afoken (Chancellor) on Jun 12, 2009 at 20:28 UTC

    Don't attempt to parse HTML (or XML) with RegExps, that won't work. Use a proper parser, like HTML::Parser.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      I'm just doing some global substitutions to clean up a slurped doc.

Re: regular expression and nested tags
by bichonfrise74 (Vicar) on Jun 12, 2009 at 20:35 UTC
Re: regular expression and nested tags
by ikegami (Patriarch) on Jun 12, 2009 at 20:40 UTC

    What is going on with the not greedy regexpr?

    Non-greedy stops at earliest position where the pattern will match (before the first </div>).
    Greedy stops at latest position where the pattern will match (before the last </div>).

Re: regular expression and nested tags
by JavaFan (Canon) on Jun 12, 2009 at 22:26 UTC
    The subpattern is not greedy. However, having a non-greedy subpattern does not mean that of all possible matches anywhere in the string, Perl is going to find the smallest overall match.

    It just means that when it's time to do matching of the non-greedy subpattern, of all the possible submatches that start of the current point that will lead to an overall match, it'll pick the shortest submatch.

    So, the <div will match the first div, then after the matching for the first div Perl finds the smallest submatch that makes the entire pattern match.

    Hence the result you're getting.

    If you want to match <div> and </div> with no <div> in between, write your pattern accordingly. Or just use one of the many HTML parsing modules out there.

Re: regular expression and nested tags
by graff (Chancellor) on Jun 13, 2009 at 16:22 UTC
    Given these two things you've said so far:
    Please note that capital letters from my example represent any HTML string, with other tags included.

    I'm just doing some global substitutions to clean up a slurped doc.

    I have a hunch the task could be complicated enough that using regexes really isn't the way to go (unless you know something for certain about the parts you need to "clean up" that you haven't mentioned here so far).

    In any case, it's definitely worthwhile to work your way out of this misconception, that using a real parser is too hard for a "simple" problem like yours, which, if I understand correctly, involves locating the content of the inner-most "div" within a set of nested divs.

    Your own sample string doesn't do justice to your statement of the problem, so here's a basic parser demo that includes a test string "with other tags included" -- it took me about 15 minutes (update: well, less than 30, for sure -- my how time flies), including looking stuff up in the HTML::Parser man page:

    #!/usr/bin/perl use strict; use warnings; use HTML::Parser; my $html = <<EOT; <html><div>foo_a<div>foo_b <div>foo_c0 <a href=\"bar\">baz</a> foo_c1</div> foo_d</div>foo_e</div></html> EOT my ( $divtext, $indiv ); my $p = HTML::Parser->new( api_version => 3, start_h => [\&div_check, "tagname,text"], text_h => [sub { $divtext .= $_[0] if $indi +v }, "dtext"], end_h => [\&work_on_divtext, "tagname,text" +] ); $p->parse( $html ); sub div_check { my ( $tag, $text ) = @_; if ( $tag eq 'div' ) { $divtext = ''; $indiv = 1; } elsif ( $indiv ) { $divtext .= $text; } } sub work_on_divtext { my ( $tag, $text ) = @_; if ( $tag eq 'div' ) { print "=$divtext=\n" if ( $divtext ); $divtext = ''; $indiv = 0; } elsif ( $indiv ) { $divtext .= $text; } }
    Depending on what you are really trying to accomplish, that script could probably be made a fair bit simpler than it is, but personally, I don't think it's all that complicated, and it's a lot more reliable, flexible, readable, maintainable, etc, etc, than any regex solution I could come up with (assuming I could come up with one, which I'm not sure I'd want to try).

      Well, I've spend much more than an hour (actually over 4 hours) reading docs on advanced patterns and trying. :-P

      Finally, each of the following patterns did what wanted:

      $a =~ m%<div\b[^>]*>((?:(?!</div>)(?!<div\b).)*)</div>%; $a =~ m%<div\b[^>]*>((?:(?!<div\b).)*?)</div>%;

      My first attempt was to use the simple "<div\b[^>]*>([^<>]*)</div>" pattern, but it failed when the string contained included any other tag.

      The following example shows better what I needed to do (in this case, class Y container being removed):

      #!perl; $a = 'A<div class="X">B<div class="Y">C<span class="Z">D</span>E</div> +F</div>G'; print "before: $a\n"; $a =~ s%<div\b[^>]*>((?:(?!<div\b).)*?)</div>%\1%g; print "after: $a\n";

      Anyway, there is too much to learn on this topic...

      Thanks to everyone!

Re: regular expression and nested tags
by ikegami (Patriarch) on Jun 12, 2009 at 20:42 UTC

    What is going on with the not greedy regexpr?

    Non-greedy stops at earliest position where the pattern will match (before the first </div>).
    Greedy stops at latest position where the pattern will match (before the last </div>).

      Just thought that starting at <div> after B would result in a better nongreedy result.

      Is there a way to match a string contained between two delimiters which does not contain a given word? (not a class)

        No, it only affects /.*/. It doesn't affect any other part of the match, namely, the earlier /<div>/.

        Is there a way to match a string contained between two delimiters which does not contain a given word? (not a class)

        The following would do here:
        m{<div>(?:(?!</div>).)*</div>}