HTML::Parser to extract link text?

isync has asked for the wisdom of the Perl Monks concerning the following question:

As in this post I use HTML::Parser to extract the content of the a href tag.

$p->handler( start => \&a_start_handler, "tagname,self,attr" );
$p->unbroken_text( 1 );
$p->parse( $content ) || die $!;

foreach my $link ( @linklist ){
 print $link->[0]; #link
 print $link->[1]; #text
}

sub a_start_handler {
  my( $tag, $self, $attr ) = @_;

  # we only act on <a tags
  return if $tag ne "a";

  if( defined( $href = $attr->{href} ) ){
    $self->handler(text => sub { $text = shift; $text =~ s/\n//g; },"d
+text");
    $self->handler( end => \&a_end_handler, "tagname,self" );
  }

  foreach my $key ( keys %$attr ){
  #  print ">$key=$attr->{$key}\n";
  }


}

sub a_end_handler {
  return if shift ne "a";
  my $self = shift;

  push @linklist, [ $href, $text ] if defined $text && $text !~ /^\s*$
+/;

  $self->handler(end => undef );
  $self->handler(text => undef );
}
[download]

And as it should, it properly strips all html from it. But now, I need to get also all markup contained in the link. But as it seems, changing this line

    $self->handler(text => sub { $text = shift; $text =~ s/\n//g; },"d
+text");
[download]

$self->handler( text => sub { $text = shift; $text =~ s/\n//g; },"text
+");
[download]

does not give the expected result (switch from getting dtext to getting text).
Any hint. Am I on the wrong track here (admitting that the parser interface is a bit hard to understand)?

Comment on HTML::Parser to extract link text? Select or Download Code

Replies are listed 'Best First'.
Re: HTML::Parser to extract link text? by Juerd (Abbot) on Jun 19, 2007 at 18:13 UTC
I wouldn't use HTML::Parser if I wanted just parts of the document. H::P is nice if you want to iteratively go through an entire page, but maintaining state quickly becomes boring and error prone. HTML::TreeBuilder, which is based on HTML::Parser, is easier to use. `use strict; use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; $tree->parse_file("test.html"); my $content_as_html = sub { join "", map { ref($_) ? $_->as_HTML : $_ } shift->content_list; }; for my $element ($tree->look_down(_tag => "a", href => qr/./)) { my $content = $element->$content_as_html; my $href = $element->attr("href"); $content =~ s/\n//g; print ">> $href, $content\n" }` [download] Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply] [d/l]
Re^2: HTML::Parser to extract link text? by isync (Hermit) on Jun 19, 2007 at 21:15 UTC
Thank you for your opinion. I made quite an effort to benchmark various versions against each other, and now that I am done I wouldn't like to go back and do it again. My results were that using HTML::Parser is the fastest solution, beating LinkExtor and LinkExtractor. My guess is that it is so fast because both link-specific modules are based on Parser. Thus using the underlying lib is even faster. Now, applying this knowledge, my feeling is that using Treebuilder would again hurt perfomance. Right? So: back to the original: Any comments on how I get Parser to do text instead of dtext??	[reply]
Re^3: HTML::Parser to extract link text? by Juerd (Abbot) on Jun 19, 2007 at 21:23 UTC
Now, applying this knowledge, my feeling is that using Treebuilder would again hurt perfomance. Right? Probably. I'm curious, how many HTML pages will you be parsing per second in your finished product? Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply]
Re^4: HTML::Parser to extract link text? by isync (Hermit) on Jun 19, 2007 at 21:32 UTC
Re^5: HTML::Parser to extract link text? by Juerd (Abbot) on Jun 19, 2007 at 21:50 UTC
Some notes below your chosen depth have not been shown here