in reply to nested tag matching
First of all, if you're a novice you should have the following code at the top of all your code:
use strict; use warnings;
This will help catch mistakes before they happen. Second, why aren't you using a module from CPAN to parse the HTML, i.e. HTML::TreeBuilder. You should never mess around with regular expressions on HTML. The original SGML specifications from which HTML is derived are pretty loose, which means for every rule there are a half dozen exceptions (or more!) which will render under most browsers even though it makes for a pain to parse. Not only will it make your code more robust, it will make your code much more intuitive to read, i.e.:
use strict; use warnings; use HTML::TreeBuilder; my $HTML_to_parse = shift (@ARGV); my $tree = HTML::TreeBuilder->new; $tree->parse($HTML_to_parse); $tree->eof; my @paragraph_tags = $tree->look_down('_tag', 'p'); foreach my $p (@paragraph_tags) { # note that this variable will "hide" the other # copy of @paragraph_tags and be garbage collected # as soon as it goes out of scope (the end of the # while loop) my @paragraph_tags = $p->look_down('_tag', 'p'); if (scalar (@paragraph_tags) == 1) { my $tag = shift (@paragraph_tags); my @contents = $tag->content_list; my $content = ""; foreach my $con (@contents) { # check that we have text and not an object $content .= $con unless (ref $con); } print $content; } }
Just to give you an idea of why using regular expressions to parse HTML is a bad idea, look at this:
<p class="foo">This is <p class="bar">HTML code using CSS Style sheets +.</p></p>
Now you have no contingencies for the class="" in your original regular expressions. So your code would break on a page that made use of attributes for any of the tags. HTML::TreeBuilder would take it in stride and let you access the attributes if you ever needed to use them using: my %attr = $node->all_external_attr;. So again, don't reinvent the wheel if you don't have to.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: nested tag matching
by Anonymous Monk on Feb 06, 2004 at 11:58 UTC | |
by Vautrin (Hermit) on Feb 06, 2004 at 14:12 UTC |