in reply to Using HTTP::LinkExtor to get URL and description info

HTML::TreeBuilder might be overkill for what you need, but it's simple:
use HTML::TreeBuilder; use strict; # examples aren't exempt!!! my $parser = new HTML::TreeBuilder; $parser->parse($html_code_from_elsewhere); my @links = $parser->look_down('_tag' => 'a'); foreach my $link (@links) { my $href = $link->attr('href'); my $descr = $link->content->[0]; # Assumes only simple text conten +ts } $parser->delete();

Replies are listed 'Best First'.
Re: Re: Using HTTP::LinkExtor to get URL and description info
by jordanh (Chaplain) on Aug 10, 2002 at 16:05 UTC
    I've been doing some web automation and I'm using HTML::TreeBuilder everywhere.

    I was also afraid of overkill, but when you don't need the power, you don't have to use it, and it has made a few things really easy compared to what I could do with other tools.

    Btw, I think your code could be improved in this way:

    use HTML::TreeBuilder; use strict; # examples aren't exempt!!! my $parser = new HTML::TreeBuilder; $parser->parse($html_code_from_elsewhere); my @links = $parser->look_down('_tag' => 'a'); foreach my $link (@links) { my $href = $link->attr('href'); my $descr = $link->as_text(); } $parser->delete();

    This removes the assumption about only simple text contents and only gets text from the anchor element. Your code would have gotten markup elements embedded in the anchor element, like:

    <a href="..."><p class="big-and-bold">Winners!</p> for today</a>

    Fetching $link->content[0] on the above would get you an HTML::Element.

    I know you pointed out this limitation, but I think the original Seeker might like to have the as_text() method pointed out as extracting text from HTML appears to be the thing of interest.