Parsing nested HTML with just regex

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Given the following HTML:

<div id="iwant">
  <div id="insideiwant">
  </div>
</div>

<div id="nowant">
</div>
[download]

I want to get the contents of the first outter div block (with id "iwant").

However there may be more div blocks after the iwant block. Thus I can't use greedy regex because it will pick up all the div blocks, not just the section i want.

The id tags are there just for the sake of identification. In a real world example, there would be no way to differentiate div blocks besides the heirarchy of the html.

I figure maybe something with lookaheads might do it but I'm just barely into learning how to use lookaheads and the like.

Thanks.

-MN

Comment on Parsing nested HTML with just regex Download Code

Replies are listed 'Best First'.
(jeffa) Re: Parsing nested HTML with just regex by jeffa (Bishop) on Jul 23, 2003 at 15:42 UTC
And just to give you some incintive to use a parser, here is some code that uses HTML::TokeParser::Simple. I don't know what you want to do with these div's, but this code will create a two dimensional array where each 'row' is the 'level' found (outermost div's are level 1 - actually index 0 in the array), and the columns are the div id's. This code does not associate the nested div's - in other words, after parsing you will not know that "insideiwant" was actually inside "iwant". That is not particularly hard to do, it's just that this code doesn't do that. ;) use strict; use warnings; use HTML::TokeParser::Simple; use Data::Dumper; my $parser = HTML::TokeParser::Simple->new('foo.html'); my $level = 0; my @div; while ( my $token = $parser->get_token ) { if ($token->is_start_tag('div')) { push @{$div[$level]}, $token->return_attr->{id}; $level++; } elsif ($token->is_end_tag('div')) { $level--; } } # print all div's found for my $row (@div) { print "level ", ++$level, " div's:\n"; print "\t$_\n" for @{$row}; } # print first outer level div found print "the first div found had id '", $div[0][0], "'\n"; [download] When run with your HTML provided, the output is: `level 1 div's: iwant nowant level 2 div's: insideiwant the first div found had id 'iwant'` Hope this helps, if it doesn't, then feel free to respond with more questions. :) jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l]
Re: Parsing nested HTML with just regex by dragonchild (Archbishop) on Jul 23, 2003 at 15:25 UTC
Use HTML::Parser. Do not use a regex. HTML::Parser will do everything you want, it'll continue to work when your HTML changes or is malformed, and it's been tested in thousands of different situations. In addition, it's maintained for free by someone other than you (which leaves more time for you to work on your real problems, not parsing). ------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply]
Re: Parsing nested HTML with just regex by Abigail-II (Bishop) on Jul 23, 2003 at 19:48 UTC
Well, people, including me, are always saying you shouldn't try to parse HTML with a regexp. It's not because it's impossible. It is possible. But you shouldn't do it because doing it with a regex is non-trivial. The program below will use a regex to extract a div element with a certain id from a piece of limited HTML. I say limited, because the regex doesn't take comments into account, or CDATA declared content. It won't be able to recover from misplaced `</div>` tags either. #!/usr/bin/perl use strict; use warnings; $_ = <<'--'; <div id = "foo"> Foo text <div id = "iwant"> Text text. <div id = "insideiwant"> Bla </div> <div id = "alsoinsideiwant"> Bla bla <em>Bla</em>! <div id = "innerinnerdiv"> Inner! </div> </div> </div> </div> -- my $div; $div = qr {<div \s+ (?:id \s* = \s* (?: "[^"]" \| '[^']' \| [-.\d]+))? \s* > (?: (?>[^<]+) \| <(?!/?div) \| (??{$div}) ) * </div>}ix; my $iwant = qr {<div \s+ id \s* = \s* (?: "iwant" \| 'iwant' \| iwant) \s* > (?: (?>[^<]+) \| <(?!/?div) \| (??{$div}) ) * </div>}ix; print $&, "\n" if /$iwant/; # Don't try to be the smartass # to point out potential issues # about $&. They are irrelevant # here. __END__ <div id = "iwant"> Text text. <div id = "insideiwant"> Bla </div> <div id = "alsoinsideiwant"> Bla bla <em>Bla</em>! <div id = "innerinnerdiv"> Inner! </div> </div> </div> [download] Abigail	[reply] [d/l] [select]
Re: Parsing nested HTML with just regex by Anonymous Monk on Jul 23, 2003 at 15:31 UTC
As many computer scientists know, a pure regular expression (one reducible to \|, (), * and concatenations) cannot parse balanced characters. See the 'pumping lemma' in any good textbook on language theory; my class used Aho and Ullman's, but I forget the exact title/edition. As many perl programmers know, perl regexps aren't actually pure because they can call perl code and use backreferences. My perlre page tells me that you can use `(?{code})`, but I'm not sure, I don't use that construct much. Finally, I'd be remiss if I failed to tell you not to parse nested structures, especially complex markup, with regexps. Use a parser-- you can find several on CPAN.	[reply] [d/l]