George_Sherston has asked for the wisdom of the Perl Monks concerning the following question:

Sibling monks, give ear unto my tale I pray.

I'm working on a script that will crawl a number of newspaper sites and pull out articles that match some search criteria. So at the moment I have a file, "papers.txt" of URLs chosen because they are teh main link pages for the articles I'm interested in, and I'm going through them like this:
use LWP::Simple; open PAPERS, "papers.txt"; while (<PAPERS>) { $page = $_; $html = get $page or next; while ($html =~ s/<a href="(.*?)".*?>(.*?)<//is) { my $url = $1; my $headline = $2; if ($headline =~ /$regexp/i) { # defined elsewhere my $article; # get the article - but HOW? SaveArticle($article,$headline,$url); } } }
Now, this works as far as it goes. But obviously it doesn't go all the way. What I want is to take $url and feed it again to LWP::Simple to pull out the text of the article which I can then parse and save. And I can't rely on using $url because it's almost certainly a relative URL.

My first try was:
while (<PAPERS>) { $page = $_; $html = get $page or next; my $path = $_; $path =~ s#^(.*)/.*?$#$1#; while ($html =~ s/<a href="(.*?)".*?>(.*?)<//is) { my $url = $1; my $headline = $2; if ($headline =~ /$regexp/i) { $url = $path . $url; my $article = get $url || 'goom'; SaveArticle($article,$headline,$url); } } }
But quite often this fails for one reason or another... perhaps because the page I'm on is on a different path from the page I want. What I'd really like is to get Perl to do exactly what I'd do, that is, click on the link, and let the page itself take care of the question whether it's a relative link and what it's relative *to*.

Is there a neat way to do this?

I had a look at URI::URL::full_path, but as far as I can see this wd save me some time *if* I already knew what I needed to create the absolute URL, but I don't.

It may be that I am betraying some ignorance of how URLs work, and if so I apologise for asking a dim question an solicity enlightenment.

§ George Sherston

Replies are listed 'Best First'.
Re: Handling relative urls with LWP or something else
by merlyn (Sage) on Dec 19, 2001 at 20:55 UTC
(ichimunki) Re: Handling relative urls with LWP or something else
by ichimunki (Priest) on Dec 19, 2001 at 21:19 UTC
    While you might want to keep merlyn's columns in mind as good reading and a source of (hopefully) good examples using common CPAN modules, you might also want to check the POD for URI. And I quote:

    $uri = URI->new_abs( $str, $base_uri )
    This constructs a new absolute URI object. The $str argument can denote a relative or absolute URI. If relative, then it will be absolutized using $base_uri as base. The $base_uri must be an absolute URI.

    So if you know the URL of the page you got the URL from, you know $base_uri. A simple test like $url =~ /^http/ should let you know whether you've gotten an absolute URL or not. Of course, testing is not necessary as the constructor will gladly ignore the $base_uri if $str is absolute. Or so sayeth the POD anyway.
Re: Handling relative urls with LWP or something else
by tachyon (Chancellor) on Dec 19, 2001 at 21:11 UTC

    If you have a look here you can check out the code of a link checker I wrote. Naturally it needs to crawl an entire site and correctly handle relative links. As you will see it takes lesss than half a dozen lines to do this.

    Update

    Actually looking at that code there is a subtle bug in it. Here is a complete sub that I will use to modify that code (one day :-)
    print rel2abs( 'http://foo.com/aa/bb/cc', './../.././bar.htm' ); sub rel2abs { my ($root, $link) = @_; $link =~ s|^\s*/||; # trim link $root .= '/' unless $root =~ m|/$|; # ensure trailing / # move back up tree in response to ../ and ignore ./ while ( $link =~ s|^(\.?\./)|| ) { next if $1 eq './'; # effectively just delete ./ # trim dirs from root only until we can't go any further! $root =~ s|[^/]+/$|| unless $root =~ m|http://[^/]+/$|i; } return $root.$link; }

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print