Re: regular expression and nested tags

Given these two things you've said so far:

Please note that capital letters from my example represent any HTML string, with other tags included.
I'm just doing some global substitutions to clean up a slurped doc.

I have a hunch the task could be complicated enough that using regexes really isn't the way to go (unless you know something for certain about the parts you need to "clean up" that you haven't mentioned here so far).

In any case, it's definitely worthwhile to work your way out of this misconception, that using a real parser is too hard for a "simple" problem like yours, which, if I understand correctly, involves locating the content of the inner-most "div" within a set of nested divs.

Your own sample string doesn't do justice to your statement of the problem, so here's a basic parser demo that includes a test string "with other tags included" -- it took me about 15 minutes (update: well, less than 30, for sure -- my how time flies), including looking stuff up in the HTML::Parser man page:

#!/usr/bin/perl

use strict;
use warnings;
use HTML::Parser;

my $html = <<EOT;
<html><div>foo_a<div>foo_b
<div>foo_c0 <a href=\"bar\">baz</a> foo_c1</div>
foo_d</div>foo_e</div></html>
EOT

my ( $divtext, $indiv );
my $p = HTML::Parser->new( api_version => 3,
                           start_h => [\&div_check, "tagname,text"],
                           text_h => [sub { $divtext .= $_[0] if $indi
+v }, "dtext"],
                           end_h => [\&work_on_divtext, "tagname,text"
+] );

$p->parse( $html );

sub div_check
{
    my ( $tag, $text ) = @_;
    if ( $tag eq 'div' ) {
        $divtext = '';
        $indiv = 1;
    }
    elsif ( $indiv ) {
        $divtext .= $text;
    }
}

sub work_on_divtext
{
    my ( $tag, $text ) = @_;
    if ( $tag eq 'div' ) {
        print "=$divtext=\n" if ( $divtext );
        $divtext = '';
        $indiv = 0;
    }
    elsif ( $indiv ) {
        $divtext .= $text;
    }
}
[download]

Depending on what you are really trying to accomplish, that script could probably be made a fair bit simpler than it is, but personally, I don't think it's all that complicated, and it's a lot more reliable, flexible, readable, maintainable, etc, etc, than any regex solution I could come up with (assuming I could come up with one, which I'm not sure I'd want to try).

Comment on Re: regular expression and nested tags Download Code

Replies are listed 'Best First'.
Re^2: regular expression and nested tags by vitoco (Hermit) on Jun 15, 2009 at 15:23 UTC
Well, I've spend much more than an hour (actually over 4 hours) reading docs on advanced patterns and trying. :-P Finally, each of the following patterns did what wanted: `$a =~ m%<div\b[^>]>((?:(?!</div>)(?!<div\b).))</div>%; $a =~ m%<div\b[^>]>((?:(?!<div\b).)?)</div>%;` [download] My first attempt was to use the simple "`<div\b[^>]>([^<>])</div>`" pattern, but it failed when the string contained included any other tag. The following example shows better what I needed to do (in this case, class Y container being removed): `#!perl; $a = 'A<div class="X">B<div class="Y">C<span class="Z">D</span>E</div> +F</div>G'; print "before: $a\n"; $a =~ s%<div\b[^>]>((?:(?!<div\b).)?)</div>%\1%g; print "after: $a\n";` [download] Anyway, there is too much to learn on this topic... Thanks to everyone!	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: regular expression and nested tags
by vitoco (Hermit) on Jun 15, 2009 at 15:23 UTC

Well, I've spend much more than an hour (actually over 4 hours) reading docs on advanced patterns and trying. :-P

Finally, each of the following patterns did what wanted:

$a =~ m%<div\b[^>]*>((?:(?!</div>)(?!<div\b).)*)</div>%;
$a =~ m%<div\b[^>]*>((?:(?!<div\b).)*?)</div>%;
[download]

My first attempt was to use the simple "<div\b[^>]*>([^<>]*)</div>" pattern, but it failed when the string contained included any other tag.

The following example shows better what I needed to do (in this case, class Y container being removed):

#!perl;
$a = 'A<div class="X">B<div class="Y">C<span class="Z">D</span>E</div>
+F</div>G';
print "before: $a\n";
$a =~ s%<div\b[^>]*>((?:(?!<div\b).)*?)</div>%\1%g;
print "after: $a\n";
[download]

Anyway, there is too much to learn on this topic...

Thanks to everyone!

[reply]
[d/l]
[select]