in reply to HTML::Parser to extract link text?

I wouldn't use HTML::Parser if I wanted just parts of the document. H::P is nice if you want to iteratively go through an entire page, but maintaining state quickly becomes boring and error prone.

HTML::TreeBuilder, which is based on HTML::Parser, is easier to use.

use strict; use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; $tree->parse_file("test.html"); my $content_as_html = sub { join "", map { ref($_) ? $_->as_HTML : $_ } shift->content_list; }; for my $element ($tree->look_down(_tag => "a", href => qr/./)) { my $content = $element->$content_as_html; my $href = $element->attr("href"); $content =~ s/\n//g; print ">> $href, $content\n" }

Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

Replies are listed 'Best First'.
Re^2: HTML::Parser to extract link text?
by isync (Hermit) on Jun 19, 2007 at 21:15 UTC
    Thank you for your opinion.

    I made quite an effort to benchmark various versions against each other, and now that I am done I wouldn't like to go back and do it again.

    My results were that using HTML::Parser is the fastest solution, beating LinkExtor and LinkExtractor. My guess is that it is so fast because both link-specific modules are based on Parser. Thus using the underlying lib is even faster.

    Now, applying this knowledge, my feeling is that using Treebuilder would again hurt perfomance. Right?

    So: back to the original: Any comments on how I get Parser to do text instead of dtext??

      Now, applying this knowledge, my feeling is that using Treebuilder would again hurt perfomance. Right?

      Probably. I'm curious, how many HTML pages will you be parsing per second in your finished product?

      Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

        This questions does not exactly apply. ;-) But for the link-extraction part, Parser took around 0.08secs where Extor did 0.12 and Extractor 0.28secs (for average html).

        Any help with my Parser question?