isync has asked for the wisdom of the Perl Monks concerning the following question:

As in this post I use HTML::Parser to extract the content of the a href tag.
$p->handler( start => \&a_start_handler, "tagname,self,attr" ); $p->unbroken_text( 1 ); $p->parse( $content ) || die $!; foreach my $link ( @linklist ){ print $link->[0]; #link print $link->[1]; #text } sub a_start_handler { my( $tag, $self, $attr ) = @_; # we only act on <a tags return if $tag ne "a"; if( defined( $href = $attr->{href} ) ){ $self->handler(text => sub { $text = shift; $text =~ s/\n//g; },"d +text"); $self->handler( end => \&a_end_handler, "tagname,self" ); } foreach my $key ( keys %$attr ){ # print ">$key=$attr->{$key}\n"; } } sub a_end_handler { return if shift ne "a"; my $self = shift; push @linklist, [ $href, $text ] if defined $text && $text !~ /^\s*$ +/; $self->handler(end => undef ); $self->handler(text => undef ); }

And as it should, it properly strips all html from it. But now, I need to get also all markup contained in the link. But as it seems, changing this line
$self->handler(text => sub { $text = shift; $text =~ s/\n//g; },"d +text");
to
$self->handler( text => sub { $text = shift; $text =~ s/\n//g; },"text +");
does not give the expected result (switch from getting dtext to getting text).
Any hint. Am I on the wrong track here (admitting that the parser interface is a bit hard to understand)?

Replies are listed 'Best First'.
Re: HTML::Parser to extract link text?
by Juerd (Abbot) on Jun 19, 2007 at 18:13 UTC

    I wouldn't use HTML::Parser if I wanted just parts of the document. H::P is nice if you want to iteratively go through an entire page, but maintaining state quickly becomes boring and error prone.

    HTML::TreeBuilder, which is based on HTML::Parser, is easier to use.

    use strict; use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; $tree->parse_file("test.html"); my $content_as_html = sub { join "", map { ref($_) ? $_->as_HTML : $_ } shift->content_list; }; for my $element ($tree->look_down(_tag => "a", href => qr/./)) { my $content = $element->$content_as_html; my $href = $element->attr("href"); $content =~ s/\n//g; print ">> $href, $content\n" }

    Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

      Thank you for your opinion.

      I made quite an effort to benchmark various versions against each other, and now that I am done I wouldn't like to go back and do it again.

      My results were that using HTML::Parser is the fastest solution, beating LinkExtor and LinkExtractor. My guess is that it is so fast because both link-specific modules are based on Parser. Thus using the underlying lib is even faster.

      Now, applying this knowledge, my feeling is that using Treebuilder would again hurt perfomance. Right?

      So: back to the original: Any comments on how I get Parser to do text instead of dtext??

        Now, applying this knowledge, my feeling is that using Treebuilder would again hurt perfomance. Right?

        Probably. I'm curious, how many HTML pages will you be parsing per second in your finished product?

        Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }