ldln has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I'm wondering if there exists any script or module that runs on top of HTML::Parser (or similar modules) that let us specify what to look for when parsing?

For example we might want to have start trigger of tag <x> with attribute named <y> with value /<z>/ (regexp) to indicate where we want to start looking for data in html-file.
We may then possibly specify in some simple way what data to extract, for example:a,href,src;t;100
to get the next 100 urls and URL-linktext in document. And possibly have an end trigger, like if we see text <t> then end parsing.

Does anything like this exist?

Replies are listed 'Best First'.
Re: HTML::Parser API script or module
by wfsp (Abbot) on Jun 04, 2005 at 18:16 UTC
    This uses HTML::TokeParser
    #!/bin/perl5 use strict; use warnings; use HTML::TokeParser; my $file = 'map2004.html'; my $tp = HTML::TokeParser->new($file) or die "Couldn't read html file: $!"; # start tag, attrib, value my ($s_tag, $s_attrb, $s_value) = qw(div class menu); # end tag my ($e_tag) = 'h6'; my $max = 20; my $count; my $start; # flag # typo fixed my %data; # hash to hold output while ( my $tag = $tp->get_token ) { next if $tag->[0] eq 'S' and $tag->[1] eq $s_tag and exists $tag->[2]->{$s_attrb} and $tag->[2]->{$s_attrb} eq $s_value and ++$start; next unless $start; last if $tag->[0] eq 'S' and $tag->[1] eq $e_tag; if ( $tag->[0] eq 'S' and $tag->[1] eq 'a' and exists $tag->[2]->{href} ){ my $href = $tag->[2]->{href}; my $link_text = $tp->get_trimmed_text('/a'); $data{$href} = $link_text; $count++; last if $count == $max; } } for my $key (sort keys %data){ print "$key -> $data{$key}\n"; } # ["S", $tag, $attr, $attrseq, $text] # ["E", $tag, $text] # ["T", $text, $is_data] # ["C", $text] # ["D", $text] # ["PI", $token0, $text]

    update

    "...specify in some simple way what data to extract..."
    That's trickier because it depends on 'what data'.
    I've always found it relatively easy to adapt the above type of script.

    update 2

    Fixed typo.

      Untested, but you can simplify that while loop and make it easier to read by switching to HTML::TokeParser::Simple:

      while ( my $tag = $tp->get_token ) { next unless $tag->is_start_tag($s_tag) and ($tag->get_attr($s_attrb) || '') eq $s_value and ++$start; last if $tag->is_end_tag($e_tag); if ($tag->is_start_tag('a') && $tag->get_attr('href')) { $data{$tag->get_attr('href')} = $tag->get_trimmed_text('/a'); $count++; last if $count == $max; } }

      I may have missed some particulars, but you can see how the code is easier to read.

      Cheers,
      Ovid

      New address of my CGI Course.

Re: HTML::Parser API script or module
by skillet-thief (Friar) on Jun 04, 2005 at 20:07 UTC

    I would take a serious look at HTML::Tree (which is composed of HTML::TreeBuilder and HTML::Element). It runs on top of HTML::Parser and has lots of great functions that can do exactly what your are looking for, as well as lots of other tree-walking and manipulating magic.

    For example, you could do something like this (untested):

    use HTML::Tree; my $tree = HTML::Tree->new_from_file( "myfile.html"); my $html_object = $tree->look_down("_tag", "x", # find an x element + "y", qr/z/); # that has a y attribut +e matching z
    IIRC, you can do the same thing in list contest and get a list of all the <y></y> elements in the file, or in a particular leaf of the tree.

    Then you can pull whatever you want out (ie. other attributes) of your $html_object, see it as text:

    my $text = $html_object->as_text;
    or as raw HTML:
    my $html = $html_object->as_HTML;

    There is a bit of a learning curve (or at least there was for me), in that you have to get used to thinking of your document as a tree, and not as a text per se. But once you've get it, you can do a lot of things very cleanly.

Re: HTML::Parser API script or module
by Corion (Patriarch) on Jun 04, 2005 at 18:15 UTC

    You are looking for XML::XPath, which can sit on top of XML::Parser or XML::LibXML.

    XPath expressions are a W3C standard for queries against XML in a regex-like syntax:

    //a@href # select all a tags with a href= attribute //table/tr/td/a # select all a tags in a table