in reply to HTML::Parser API script or module

This uses HTML::TokeParser
#!/bin/perl5 use strict; use warnings; use HTML::TokeParser; my $file = 'map2004.html'; my $tp = HTML::TokeParser->new($file) or die "Couldn't read html file: $!"; # start tag, attrib, value my ($s_tag, $s_attrb, $s_value) = qw(div class menu); # end tag my ($e_tag) = 'h6'; my $max = 20; my $count; my $start; # flag # typo fixed my %data; # hash to hold output while ( my $tag = $tp->get_token ) { next if $tag->[0] eq 'S' and $tag->[1] eq $s_tag and exists $tag->[2]->{$s_attrb} and $tag->[2]->{$s_attrb} eq $s_value and ++$start; next unless $start; last if $tag->[0] eq 'S' and $tag->[1] eq $e_tag; if ( $tag->[0] eq 'S' and $tag->[1] eq 'a' and exists $tag->[2]->{href} ){ my $href = $tag->[2]->{href}; my $link_text = $tp->get_trimmed_text('/a'); $data{$href} = $link_text; $count++; last if $count == $max; } } for my $key (sort keys %data){ print "$key -> $data{$key}\n"; } # ["S", $tag, $attr, $attrseq, $text] # ["E", $tag, $text] # ["T", $text, $is_data] # ["C", $text] # ["D", $text] # ["PI", $token0, $text]

update

"...specify in some simple way what data to extract..."
That's trickier because it depends on 'what data'.
I've always found it relatively easy to adapt the above type of script.

update 2

Fixed typo.

Replies are listed 'Best First'.
Re^2: HTML::Parser API script or module
by Ovid (Cardinal) on Jun 04, 2005 at 20:49 UTC

    Untested, but you can simplify that while loop and make it easier to read by switching to HTML::TokeParser::Simple:

    while ( my $tag = $tp->get_token ) { next unless $tag->is_start_tag($s_tag) and ($tag->get_attr($s_attrb) || '') eq $s_value and ++$start; last if $tag->is_end_tag($e_tag); if ($tag->is_start_tag('a') && $tag->get_attr('href')) { $data{$tag->get_attr('href')} = $tag->get_trimmed_text('/a'); $count++; last if $count == $max; } }

    I may have missed some particulars, but you can see how the code is easier to read.

    Cheers,
    Ovid

    New address of my CGI Course.