George_Sherston has asked for the wisdom of the Perl Monks concerning the following question:
Now, this works as far as it goes. But obviously it doesn't go all the way. What I want is to take $url and feed it again to LWP::Simple to pull out the text of the article which I can then parse and save. And I can't rely on using $url because it's almost certainly a relative URL.use LWP::Simple; open PAPERS, "papers.txt"; while (<PAPERS>) { $page = $_; $html = get $page or next; while ($html =~ s/<a href="(.*?)".*?>(.*?)<//is) { my $url = $1; my $headline = $2; if ($headline =~ /$regexp/i) { # defined elsewhere my $article; # get the article - but HOW? SaveArticle($article,$headline,$url); } } }
But quite often this fails for one reason or another... perhaps because the page I'm on is on a different path from the page I want. What I'd really like is to get Perl to do exactly what I'd do, that is, click on the link, and let the page itself take care of the question whether it's a relative link and what it's relative *to*.while (<PAPERS>) { $page = $_; $html = get $page or next; my $path = $_; $path =~ s#^(.*)/.*?$#$1#; while ($html =~ s/<a href="(.*?)".*?>(.*?)<//is) { my $url = $1; my $headline = $2; if ($headline =~ /$regexp/i) { $url = $path . $url; my $article = get $url || 'goom'; SaveArticle($article,$headline,$url); } } }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Handling relative urls with LWP or something else
by merlyn (Sage) on Dec 19, 2001 at 20:55 UTC | |
|
(ichimunki) Re: Handling relative urls with LWP or something else
by ichimunki (Priest) on Dec 19, 2001 at 21:19 UTC | |
|
Re: Handling relative urls with LWP or something else
by tachyon (Chancellor) on Dec 19, 2001 at 21:11 UTC |