Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I've found a script that should parse all of the links from a web page. Seems pretty kool but can't get it working. I think I'm having problems interfacing with our proxy. Listed is the code found in "INSTANT PERL MODULES":
use strict; use LWP::UserAgent; use URI::URL; my $url = URI::URL->new('http://cnn.com'); my $base_url; #Creating new UserAgent browser my $ua = LWP::UserAgent->new(); #Agent named $ua->agent ("Netscape"); #Create HTTP request my $request = HTTP::Request->new(GET => $url); #Execute request my $response = $ua->request($request); #Check for success if ($response->is_success && $response->content_type eq 'text/html') { #request successful & html $base_url = $response->base(); print "Base URL: $base_url\n"; my $link_extor = HTML::LinkExtor->new(\&extract_links); $link_extor->parse($response->content); } else { #request failed print "Error getting document: ", $response->status_line, "\n"; } sub extract_links { my ($tag, %attr) = @_; if ($tag eq 'a' or $tag eq 'img') { foreach my $key (keys %attr) { if ($key eq 'href' or $key eq 'src') { my $link_url = URI->new($attr{$key}); my $full_url = $link_url->abs($base_url); print "LINK: $full_url\n"; } } } }
When running this script, I receive the error "Error getting document: 500 Can't connect to cnn.com:80"

Replies are listed 'Best First'.
Re: LWP::NOT so simple
by jasonk (Parson) on Mar 17, 2003 at 18:20 UTC

    How about an even simpler version, using my new favorite module WWW::Mechanize?

    #!/usr/bin/perl -w use strict; use WWW::Mechanize; my $agent = WWW::Mechanize->new(); $agent->get('http://cnn.com/'); foreach (@{$agent->links()}) { print "LINK:".URI->new($_->[0])->abs($agent->base())."\n"; }

    We're not surrounded, we're in a target-rich environment!
Re: LWP::NOT so simple
by BrowserUk (Patriarch) on Mar 17, 2003 at 18:02 UTC

    Apart from that you have omitted the use HTML::LinkExtor; from your script as posted, it seems to work fine from my location?


    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.
Re: LWP::NOT so simple
by webfiend (Vicar) on Mar 17, 2003 at 18:05 UTC

    Does the script work at any other sites? It works dandy for me, other than the fact that "use HTML::LinkExtor" needs to be in there somewhere.

    Do you happen to know if LWP was configured for your proxy when it was installed?


    I just realized that I was using the same sig for nearly three years.

      I'm sure it wasn't...I didn't do anything special for it. How do I configure it?

        I think bart's suggestion below, at Re: LWP::NOT so simple is the way to go :-)


        I just realized that I was using the same sig for nearly three years.

Re: LWP::NOT so simple
by bart (Canon) on Mar 17, 2003 at 19:26 UTC
    Set your environment variable HTTP_PROXY to the URL of your proxy, in the format like "http://proxy.pandora.be:8080" — that's the proxy my provider makes me use. Next, after you created the $ua, and before trying to retrieve the page over http, call
    $ua->env_proxy;
    That should do the trick — unless you still need a username and password for the proxy?