HTML::Parser API script or module

ldln has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: HTML::Parser API script or module by wfsp (Abbot) on Jun 04, 2005 at 18:16 UTC
This uses HTML::TokeParser #!/bin/perl5 use strict; use warnings; use HTML::TokeParser; my $file = 'map2004.html'; my $tp = HTML::TokeParser->new($file) or die "Couldn't read html file: $!"; # start tag, attrib, value my ($s_tag, $s_attrb, $s_value) = qw(div class menu); # end tag my ($e_tag) = 'h6'; my $max = 20; my $count; my $start; # flag # typo fixed my %data; # hash to hold output while ( my $tag = $tp->get_token ) { next if $tag->[0] eq 'S' and $tag->[1] eq $s_tag and exists $tag->[2]->{$s_attrb} and $tag->[2]->{$s_attrb} eq $s_value and ++$start; next unless $start; last if $tag->[0] eq 'S' and $tag->[1] eq $e_tag; if ( $tag->[0] eq 'S' and $tag->[1] eq 'a' and exists $tag->[2]->{href} ){ my $href = $tag->[2]->{href}; my $link_text = $tp->get_trimmed_text('/a'); $data{$href} = $link_text; $count++; last if $count == $max; } } for my $key (sort keys %data){ print "$key -> $data{$key}\n"; } # ["S", $tag, $attr, $attrseq, $text] # ["E", $tag, $text] # ["T", $text, $is_data] # ["C", $text] # ["D", $text] # ["PI", $token0, $text] [download] update "...specify in some simple way what data to extract..." That's trickier because it depends on 'what data'. I've always found it relatively easy to adapt the above type of script. update 2 Fixed typo.	[reply] [d/l]
Re^2: HTML::Parser API script or module by Ovid (Cardinal) on Jun 04, 2005 at 20:49 UTC
Untested, but you can simplify that while loop and make it easier to read by switching to HTML::TokeParser::Simple: `while ( my $tag = $tp->get_token ) { next unless $tag->is_start_tag($s_tag) and ($tag->get_attr($s_attrb) \|\| '') eq $s_value and ++$start; last if $tag->is_end_tag($e_tag); if ($tag->is_start_tag('a') && $tag->get_attr('href')) { $data{$tag->get_attr('href')} = $tag->get_trimmed_text('/a'); $count++; last if $count == $max; } }` [download] I may have missed some particulars, but you can see how the code is easier to read. Cheers, Ovid New address of my CGI Course.	[reply] [d/l]
Re: HTML::Parser API script or module by skillet-thief (Friar) on Jun 04, 2005 at 20:07 UTC
I would take a serious look at HTML::Tree (which is composed of HTML::TreeBuilder and HTML::Element). It runs on top of HTML::Parser and has lots of great functions that can do exactly what your are looking for, as well as lots of other tree-walking and manipulating magic. For example, you could do something like this (untested): `use HTML::Tree; my $tree = HTML::Tree->new_from_file( "myfile.html"); my $html_object = $tree->look_down("_tag", "x", # find an x element + "y", qr/z/); # that has a y attribut +e matching z` [download] IIRC, you can do the same thing in list contest and get a list of all the `<y></y>` elements in the file, or in a particular leaf of the tree. Then you can pull whatever you want out (ie. other attributes) of your `$html_object`, see it as text: `my $text = $html_object->as_text;` [download] or as raw HTML: `my $html = $html_object->as_HTML;` [download] There is a bit of a learning curve (or at least there was for me), in that you have to get used to thinking of your document as a tree, and not as a text per se. But once you've get it, you can do a lot of things very cleanly.	[reply] [d/l] [select]
Re: HTML::Parser API script or module by Corion (Patriarch) on Jun 04, 2005 at 18:15 UTC
You are looking for XML::XPath, which can sit on top of XML::Parser or XML::LibXML. XPath expressions are a W3C standard for queries against XML in a regex-like syntax: `//a@href # select all a tags with a href= attribute //table/tr/td/a # select all a tags in a table` [download]	[reply] [d/l]