I've been doing some web automation and I'm using HTML::TreeBuilder everywhere.
I was also afraid of overkill, but when you don't need the power, you don't have to use it, and it has made a few things really easy compared to what I could do with other tools.
Btw, I think your code could be improved in this way:
use HTML::TreeBuilder;
use strict; # examples aren't exempt!!!
my $parser = new HTML::TreeBuilder;
$parser->parse($html_code_from_elsewhere);
my @links = $parser->look_down('_tag' => 'a');
foreach my $link (@links) {
my $href = $link->attr('href');
my $descr = $link->as_text();
}
$parser->delete();
This removes the assumption about only simple text contents and only gets text from the anchor element. Your code would have gotten markup elements embedded in the anchor element, like:
<a href="..."><p class="big-and-bold">Winners!</p> for today</a>
Fetching $link->content[0] on the above would get you an HTML::Element.
I know you pointed out this limitation, but I think the original Seeker might like to have the as_text() method pointed out as extracting text from HTML appears to be the thing of interest. |