Re: Using HTTP::LinkExtor to get URL and description info

HTML::TreeBuilder might be overkill for what you need, but it's simple:

use HTML::TreeBuilder;
use strict;  # examples aren't exempt!!!

my $parser = new HTML::TreeBuilder;
$parser->parse($html_code_from_elsewhere);

my @links = $parser->look_down('_tag' => 'a');
foreach my $link (@links) {
   my $href = $link->attr('href');
   my $descr = $link->content->[0];  # Assumes only simple text conten
+ts
}
$parser->delete();
[download]

Comment on Re: Using HTTP::LinkExtor to get URL and description info Download Code

Replies are listed 'Best First'.
Re: Re: Using HTTP::LinkExtor to get URL and description info by jordanh (Chaplain) on Aug 10, 2002 at 16:05 UTC
I've been doing some web automation and I'm using HTML::TreeBuilder everywhere. I was also afraid of overkill, but when you don't need the power, you don't have to use it, and it has made a few things really easy compared to what I could do with other tools. Btw, I think your code could be improved in this way: `use HTML::TreeBuilder; use strict; # examples aren't exempt!!! my $parser = new HTML::TreeBuilder; $parser->parse($html_code_from_elsewhere); my @links = $parser->look_down('_tag' => 'a'); foreach my $link (@links) { my $href = $link->attr('href'); my $descr = $link->as_text(); } $parser->delete();` [download] This removes the assumption about only simple text contents and only gets text from the anchor element. Your code would have gotten markup elements embedded in the anchor element, like: `<a href="..."><p class="big-and-bold">Winners!</p> for today</a>` [download] Fetching $link->content[0] on the above would get you an HTML::Element. I know you pointed out this limitation, but I think the original Seeker might like to have the as_text() method pointed out as extracting text from HTML appears to be the thing of interest.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Re: Using HTTP::LinkExtor to get URL and description info
by jordanh (Chaplain) on Aug 10, 2002 at 16:05 UTC

I was also afraid of overkill, but when you don't need the power, you don't have to use it, and it has made a few things really easy compared to what I could do with other tools.

Btw, I think your code could be improved in this way:

use HTML::TreeBuilder;
use strict;  # examples aren't exempt!!!

my $parser = new HTML::TreeBuilder;
$parser->parse($html_code_from_elsewhere);

my @links = $parser->look_down('_tag' => 'a');
foreach my $link (@links) {
   my $href = $link->attr('href');
   my $descr = $link->as_text();  
}
$parser->delete();
[download]

This removes the assumption about only simple text contents and only gets text from the anchor element. Your code would have gotten markup elements embedded in the anchor element, like:

<a href="..."><p class="big-and-bold">Winners!</p> for today</a>
[download]

Fetching $link->content[0] on the above would get you an HTML::Element.

I know you pointed out this limitation, but I think the original Seeker might like to have the as_text() method pointed out as extracting text from HTML appears to be the thing of interest.

[reply]
[d/l]
[select]