Re: HTML::Parser API script or module

This uses HTML::TokeParser

#!/bin/perl5

use strict;
use warnings;
use HTML::TokeParser;

my $file = 'map2004.html';
my $tp = HTML::TokeParser->new($file)
  or die "Couldn't read html file: $!";

# start tag, attrib, value  
my ($s_tag, $s_attrb, $s_value) = qw(div class menu);
# end tag
my ($e_tag) = 'h6';
my $max = 20;

my $count;

my $start;  # flag
# typo fixed
my %data;  # hash to hold output

while ( my $tag = $tp->get_token ) {
  
  next if
    $tag->[0] eq 'S' and
    $tag->[1] eq $s_tag and 
    exists $tag->[2]->{$s_attrb} and 
    $tag->[2]->{$s_attrb} eq $s_value and
    ++$start;
    
  next unless $start;
  
  last if
    $tag->[0] eq 'S' and
    $tag->[1] eq $e_tag;
    
  if (
    $tag->[0] eq 'S' and
    $tag->[1] eq 'a' and
    exists $tag->[2]->{href}
  ){
      my $href      = $tag->[2]->{href};
      my $link_text = $tp->get_trimmed_text('/a');
      $data{$href}  = $link_text;
      $count++;
      last if $count == $max;
  }
}
for my $key (sort keys %data){
  print "$key -> $data{$key}\n";
}

# ["S",  $tag, $attr, $attrseq, $text]
# ["E",  $tag, $text]
# ["T",  $text, $is_data]
# ["C",  $text]
# ["D",  $text]
# ["PI", $token0, $text]
[download]

update

"...specify in some simple way what data to extract..."

That's trickier because it depends on 'what data'.
I've always found it relatively easy to adapt the above type of script.

update 2

Fixed typo.

Comment on Re: HTML::Parser API script or module Download Code

Replies are listed 'Best First'.
Re^2: HTML::Parser API script or module by Ovid (Cardinal) on Jun 04, 2005 at 20:49 UTC
Untested, but you can simplify that while loop and make it easier to read by switching to HTML::TokeParser::Simple: `while ( my $tag = $tp->get_token ) { next unless $tag->is_start_tag($s_tag) and ($tag->get_attr($s_attrb) \|\| '') eq $s_value and ++$start; last if $tag->is_end_tag($e_tag); if ($tag->is_start_tag('a') && $tag->get_attr('href')) { $data{$tag->get_attr('href')} = $tag->get_trimmed_text('/a'); $count++; last if $count == $max; } }` [download] I may have missed some particulars, but you can see how the code is easier to read. Cheers, Ovid New address of my CGI Course.	[reply] [d/l]

Replies are listed 'Best First'.

Re^2: HTML::Parser API script or module
by Ovid (Cardinal) on Jun 04, 2005 at 20:49 UTC

Untested, but you can simplify that while loop and make it easier to read by switching to HTML::TokeParser::Simple:

while ( my $tag = $tp->get_token ) {
  next unless
    $tag->is_start_tag($s_tag) 
      and 
    ($tag->get_attr($s_attrb) || '') eq $s_value
      and
    ++$start;
  last if $tag->is_end_tag($e_tag);
   
  if ($tag->is_start_tag('a') && $tag->get_attr('href')) {
      $data{$tag->get_attr('href')} = $tag->get_trimmed_text('/a');
      $count++;
      last if $count == $max;
  }
}
[download]

I may have missed some particulars, but you can see how the code is easier to read.

Cheers,
Ovid

New address of my CGI Course.

[reply]
[d/l]