Actually, there are a bunch of examples that come with the HTML::Parser module, found in the "eg" directory. Taking the code from there, here is an example of how to parse all the text from an HTML document:
#!/usr/bin/perl -w
# Extract all plain text from an HTML file
use strict;
use HTML::Parser 3.00 ();
my %inside;
sub tag
{
my($tag, $num) = @_;
$inside{$tag} += $num;
print " "; # not for all tags
}
sub text
{
return if $inside{script} || $inside{style};
print $_[0];
}
HTML::Parser->new(api_version => 3,
handlers => [start => [\&tag, "tagname, '+1'"],
end => [\&tag, "tagname, '-1'"],
text => [\&text, "dtext"],
],
marked_sections => 1,
)->parse_file(shift) || die "Can't open file: $!\n";;
That code is located in eg/htext. After taking a look, you can see that it is event driven. The HTML::Parser->new line has an option in it called "handlers", which tells HTML::Parser which function to call upon seeing a certain tag type. In this case, every start tag calls the function "tag" with the parameters "tagname", which is the actual tagname, and +1, which identifies it as a start tag.
Personally, I have had more luck with HTML::TokeParser, but that isn't the case for everyone I'm sure. I find that HTML::TokeParser is a bit more intuitive for this sort of job, but that is perhaps just the way I think.. or maybe I just wasn't using it right ;-) In any case, good luck.
-Eric | [reply] [d/l] |
I wrote some code that's around the Monastery:
| [reply] |
#!/usr/local/bin/perl
use Data::Dumper;
use HTML::TreeBuilder;
use strict;
die "must input filename" unless @ARGV;
foreach my $file_name (@ARGV) {
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
print "Hey, here's a dump of the parse tree of $file_name:\n";
# $tree->dump; # a method we inherit from HTML::Element
# Now that we're done with it, we must destroy it.
my %table;
(
$table{root},
$table{cond},
$table{'cond-alternatives'},
$table{action},
$table{'action-entries'}
)
= $tree->find_by_tag_name('table');
my %td;
map { $td{$_} = [ $table{$_}->find_by_tag_name('td') ] } (keys %tabl
+e);
my %x;
map {
my $field = $_;
map { push @{$x{$field}}, $_->content_array_ref } @{$td{$_}}
} (keys %td);
printf "cond-alt has %s", Dumper $x{'cond-alternatives'};
$tree = $tree->delete;
}
| [reply] [d/l] |
#!/usr/bin/perl -w
use strict;
use HTML::Parser;
use LWP::Simple;
sub start {
my ($tag, $attr, $attrseq) = @_;
print "Found $tag\n";
foreach(@$attrseq) {
print " [$_ -> $attr->{$_}]\n";
}
}
my $h = HTML::Parser->new(start_h => [\&start,'tagname, attr, attrseq'
+]);
my $page = get(shift);
$h->parse($page);
Greetz
Beatnik
... Quidquid perl dictum sit, altum viditur. | [reply] [d/l] |
perldoc HTML::Parser
or the thread Who has used HTML::Parser??
for some samples of usage | [reply] [d/l] |
I'd be after an example because the documentation isn't really clear that it can do what I want...
Unfortunatly, the other node to which you refer in turn refers to version 2 of the parser (and I have 3, which, I believe, works diferently) and to TPJ, which is, er, closed...
--
RatArsed
| [reply] |