Please note that capital letters from my example represent any HTML string, with other tags included.I'm just doing some global substitutions to clean up a slurped doc.
I have a hunch the task could be complicated enough that using regexes really isn't the way to go (unless you know something for certain about the parts you need to "clean up" that you haven't mentioned here so far).
In any case, it's definitely worthwhile to work your way out of this misconception, that using a real parser is too hard for a "simple" problem like yours, which, if I understand correctly, involves locating the content of the inner-most "div" within a set of nested divs.
Your own sample string doesn't do justice to your statement of the problem, so here's a basic parser demo that includes a test string "with other tags included" -- it took me about 15 minutes (update: well, less than 30, for sure -- my how time flies), including looking stuff up in the HTML::Parser man page:
Depending on what you are really trying to accomplish, that script could probably be made a fair bit simpler than it is, but personally, I don't think it's all that complicated, and it's a lot more reliable, flexible, readable, maintainable, etc, etc, than any regex solution I could come up with (assuming I could come up with one, which I'm not sure I'd want to try).#!/usr/bin/perl use strict; use warnings; use HTML::Parser; my $html = <<EOT; <html><div>foo_a<div>foo_b <div>foo_c0 <a href=\"bar\">baz</a> foo_c1</div> foo_d</div>foo_e</div></html> EOT my ( $divtext, $indiv ); my $p = HTML::Parser->new( api_version => 3, start_h => [\&div_check, "tagname,text"], text_h => [sub { $divtext .= $_[0] if $indi +v }, "dtext"], end_h => [\&work_on_divtext, "tagname,text" +] ); $p->parse( $html ); sub div_check { my ( $tag, $text ) = @_; if ( $tag eq 'div' ) { $divtext = ''; $indiv = 1; } elsif ( $indiv ) { $divtext .= $text; } } sub work_on_divtext { my ( $tag, $text ) = @_; if ( $tag eq 'div' ) { print "=$divtext=\n" if ( $divtext ); $divtext = ''; $indiv = 0; } elsif ( $indiv ) { $divtext .= $text; } }
In reply to Re: regular expression and nested tags
by graff
in thread regular expression and nested tags
by vitoco
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |