Re: HTML::Parser to extract link text?

I wouldn't use HTML::Parser if I wanted just parts of the document. H::P is nice if you want to iteratively go through an entire page, but maintaining state quickly becomes boring and error prone.

HTML::TreeBuilder, which is based on HTML::Parser, is easier to use.

use strict;
use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new;
$tree->parse_file("test.html");

my $content_as_html = sub {
    join "", map { ref($_) ? $_->as_HTML : $_ } shift->content_list;
};

for my $element ($tree->look_down(_tag => "a", href => qr/./)) {
    my $content = $element->$content_as_html;
    my $href    = $element->attr("href");

    $content =~ s/\n//g;
    print ">> $href, $content\n"
}
[download]

Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

Comment on Re: HTML::Parser to extract link text? Download Code

Replies are listed 'Best First'.
Re^2: HTML::Parser to extract link text? by isync (Hermit) on Jun 19, 2007 at 21:15 UTC
Thank you for your opinion. I made quite an effort to benchmark various versions against each other, and now that I am done I wouldn't like to go back and do it again. My results were that using HTML::Parser is the fastest solution, beating LinkExtor and LinkExtractor. My guess is that it is so fast because both link-specific modules are based on Parser. Thus using the underlying lib is even faster. Now, applying this knowledge, my feeling is that using Treebuilder would again hurt perfomance. Right? So: back to the original: Any comments on how I get Parser to do text instead of dtext??	[reply]
Re^3: HTML::Parser to extract link text? by Juerd (Abbot) on Jun 19, 2007 at 21:23 UTC
Now, applying this knowledge, my feeling is that using Treebuilder would again hurt perfomance. Right? Probably. I'm curious, how many HTML pages will you be parsing per second in your finished product? Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply]
Re^4: HTML::Parser to extract link text? by isync (Hermit) on Jun 19, 2007 at 21:32 UTC
This questions does not exactly apply. ;-) But for the link-extraction part, Parser took around 0.08secs where Extor did 0.12 and Extractor 0.28secs (for average html). Any help with my Parser question?	[reply]
Re^5: HTML::Parser to extract link text? by Juerd (Abbot) on Jun 19, 2007 at 21:50 UTC
Re^6: HTML::Parser to extract link text? by Anonymous Monk on Jun 20, 2007 at 08:40 UTC