comment on

Given these two things you've said so far:

Please note that capital letters from my example represent any HTML string, with other tags included.
I'm just doing some global substitutions to clean up a slurped doc.

I have a hunch the task could be complicated enough that using regexes really isn't the way to go (unless you know something for certain about the parts you need to "clean up" that you haven't mentioned here so far).

In any case, it's definitely worthwhile to work your way out of this misconception, that using a real parser is too hard for a "simple" problem like yours, which, if I understand correctly, involves locating the content of the inner-most "div" within a set of nested divs.

Your own sample string doesn't do justice to your statement of the problem, so here's a basic parser demo that includes a test string "with other tags included" -- it took me about 15 minutes (update: well, less than 30, for sure -- my how time flies), including looking stuff up in the HTML::Parser man page:

#!/usr/bin/perl

use strict;
use warnings;
use HTML::Parser;

my $html = <<EOT;
<html><div>foo_a<div>foo_b
<div>foo_c0 <a href=\"bar\">baz</a> foo_c1</div>
foo_d</div>foo_e</div></html>
EOT

my ( $divtext, $indiv );
my $p = HTML::Parser->new( api_version => 3,
                           start_h => [\&div_check, "tagname,text"],
                           text_h => [sub { $divtext .= $_[0] if $indi
+v }, "dtext"],
                           end_h => [\&work_on_divtext, "tagname,text"
+] );

$p->parse( $html );

sub div_check
{
    my ( $tag, $text ) = @_;
    if ( $tag eq 'div' ) {
        $divtext = '';
        $indiv = 1;
    }
    elsif ( $indiv ) {
        $divtext .= $text;
    }
}

sub work_on_divtext
{
    my ( $tag, $text ) = @_;
    if ( $tag eq 'div' ) {
        print "=$divtext=\n" if ( $divtext );
        $divtext = '';
        $indiv = 0;
    }
    elsif ( $indiv ) {
        $divtext .= $text;
    }
}
[download]

Depending on what you are really trying to accomplish, that script could probably be made a fair bit simpler than it is, but personally, I don't think it's all that complicated, and it's a lot more reliable, flexible, readable, maintainable, etc, etc, than any regex solution I could come up with (assuming I could come up with one, which I'm not sure I'd want to try).

In reply to Re: regular expression and nested tags by graff
in thread regular expression and nested tags by vitoco

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.