Retrieving Links from a HTML Page

gnikol1 has asked for the wisdom of the Perl Monks concerning the following question:

Below I am trying to get the links from one page. I cannot, it return blank. The problem is not in authentication etc. Can you help me.

use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;
my @imgs = ();
$url = "http://www.sn.no/";  # for instance
$ua = new LWP::UserAgent;
$ua->proxy(['http', 'ftp'] => 'http://proxy');

# Make the parser.  Unfortunately, we don't know the base yet (it migh
+t be diffent from $url)
$p = HTML::LinkExtor->new(\&callback);

# Request document and parse it as it arrives

$res = HTTP::Request->new(GET => $url);
$res->proxy_authorization_basic("user", "pass");
$res= $ua->request($res),sub {$p->parse($_[0])};

# Expand all image URLs to absolute ones
my $base = $res->base;
@imgs = map { $_ = url($_, $base)->abs; } @imgs;

# Print them out
print join("\n", @imgs), "\n";

# Set up a callback that collect image links

sub callback {
   my($tag, %attr) = @_;
   return if $tag ne 'a href ';  # we only look closer at <img ...>
   push(@imgs, values %attr);
}
[download]

Added code tags 2002-02-21 dvergin

Comment on Retrieving Links from a HTML Page Download Code

Replies are listed 'Best First'.
Re: Retrieving Links from a HTML Page by boo_radley (Parson) on Feb 21, 2002 at 17:55 UTC
`sub callback { my($tag, %attr) = @_; return if $tag ne 'a href '; # we only look closer at <img ...> push(@imgs, values %attr); }` [download] This makes no sense whatsoever. You're including an attribute in a tag, and also tacking on some trailing spaces, and your comment indicates you want to look at images, but that doesn't jive with the comparison you're making. In fact, on closer review, this seems to be one of the examples from Link::Extor's POD. You might have more luck adapting the code from the synopsis, which prints out the links.	[reply] [d/l]
Re: Retrieving Links from a HTML Page by gav^ (Curate) on Feb 21, 2002 at 17:49 UTC
Firstly the page at http://www.sn.no/ doesn't contain any links (it is a frameset). Secondly you might want to check the response to see if everything went ok: `unless ($res->is_success) { # handle error here! }` [download] gav^	[reply] [d/l]
Re: Re: Retrieving Links from a HTML Page by gellyfish (Monsignor) on Feb 22, 2002 at 11:06 UTC
If the target page is a frameset then gnikol1 might want to see Re: Browser Emulation for a way to read the contents of the individual frames /J\	[reply]