comment on

Sibling monks, give ear unto my tale I pray.

I'm working on a script that will crawl a number of newspaper sites and pull out articles that match some search criteria. So at the moment I have a file, "papers.txt" of URLs chosen because they are teh main link pages for the articles I'm interested in, and I'm going through them like this:

use LWP::Simple;
open PAPERS, "papers.txt";
while (<PAPERS>) {
    $page = $_;
    $html = get $page or next;
    while ($html =~ s/<a href="(.*?)".*?>(.*?)<//is) {
        my $url = $1;
        my $headline = $2;
        if ($headline =~ /$regexp/i) {   # defined elsewhere
            my $article;

# get the article - but HOW?

            SaveArticle($article,$headline,$url);
        }
    }
}
[download]

Now, this works as far as it goes. But obviously it doesn't go all the way. What I want is to take $url and feed it again to LWP::Simple to pull out the text of the article which I can then parse and save. And I can't rely on using $url because it's almost certainly a relative URL.

My first try was:

while (<PAPERS>) {
    $page = $_;
    $html = get $page or next;
    my $path = $_;
    $path =~ s#^(.*)/.*?$#$1#;
    while ($html =~ s/<a href="(.*?)".*?>(.*?)<//is) {
        my $url = $1;
        my $headline = $2;
        if ($headline =~ /$regexp/i) {
            $url = $path . $url;
            my $article = get $url || 'goom';
            SaveArticle($article,$headline,$url);
        }
    }
}
[download]

But quite often this fails for one reason or another... perhaps because the page I'm on is on a different path from the page I want. What I'd really like is to get Perl to do exactly what I'd do, that is, click on the link, and let the page itself take care of the question whether it's a relative link and what it's relative *to*.

Is there a neat way to do this?

I had a look at URI::URL::full_path, but as far as I can see this wd save me some time *if* I already knew what I needed to create the absolute URL, but I don't.

It may be that I am betraying some ignorance of how URLs work, and if so I apologise for asking a dim question an solicity enlightenment.

§ George Sherston

In reply to Handling relative urls with LWP or something else by George_Sherston

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.