in reply to Crawling Relative Links from Webpages

The pdf link on the example page is not relative to the current page. It starts with /, so is absolute path relative to the current server, so combining with base() isn't going to work.

You need to combine it with the server to form the correct url, but Mech doesn't break that out for you. It will give you a URI, but that's documented in alien and so I have never been sure if it can give you the root address of the server or not. I've always used:

my( $server ) = $url =~ m[(^http://[^/]+/)];

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy

Replies are listed 'Best First'.
Re^2: Crawling Relative Links from Webpages
by listanand (Sexton) on May 08, 2010 at 01:34 UTC
    Thanks for your reply.

    Well OK. The point is how do you determine $url? In this case, the $url is "http://dspace.mit.edu" and it is not at all obvious from the webpage (looking at the source) how one would say that this is the server. I have a million different kinds of such webpages from different servers. I need a method that is generic enough to work with all of them.

    Any suggestions anyone?

    Andy

      Something like:

      my $uri = $mech->uri; my( $server ) = $url =~ m[(^http://[^/]+)/]; ... my $pdfurl = $server . $link;

      Note: There probably is some way of getting the appropriate portion of the url from URI without resorting to regex, but I've never worked out how.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        uri returns a URI object, so $mech->uri->host or $mech->uri->ihost