Crawling Relative Links from Webpages

listanand has asked for the wisdom of the Perl Monks concerning the following question:

Hello perlmonks,

I am trying to use WWW::Mechanize to build a crawler so that it is able to crawl a (large) set of webpages and able to pull out all the pdfs that each webpage hosts.

I am running into some trouble with crawling relative links out of webpages. Some webpages have only "relative" links for certain types of files, and my crawler does not get them right.

I use Mechanize to retrieve the base URL ($mech->base()) and then append it to the HREF entry of the PDF but that does not seem to work either.

I am writing the crawler for doing crawls internally, but here's one example webpage on WWW that would be a case in point: http://dspace.mit.edu/handle/1721.1/53720 So the question is how do I adapt my crawler to crawl PDFs from such webpages?

Any suggestions will be very gratefully appreciated.

Thank you.

Andy

PS: By the way, I tried using HTML::LinkExtor also, and it does not work either. It does not produce "right" URL for the PDF. It again appends the "base" URL to the relative URL just as I did manually with Mechanize above.

Comment on Crawling Relative Links from Webpages

Replies are listed 'Best First'.
Re: Crawling Relative Links from Webpages by Your Mother (Archbishop) on May 08, 2010 at 02:25 UTC
Mech provides for this. There should be no need for HTML::LinkExtor or any hackery. :) `use warnings; no warnings "uninitialized"; use strict; use WWW::Mechanize; use MIME::Types; my $mt = MIME::Types->new; my $mech = WWW::Mechanize->new(); $mech->get("http://dspace.mit.edu/handle/1721.1/53720"); for my $link ( $mech->links() ) { my $uri = $link->url_abs(); print $uri, $/ if $mt->mimeTypeOf($uri->path) eq "application/pdf"; } # http://dspace.mit.edu/bitstream/handle/1721.1/53720/MIT-CSAIL-TR-201 +0-018.pdf?sequence=1` [download] (Update: please only use these tools under the Terms of Service of any given site. The better hackers behave, the more likely we are to get open resources facing us.) (Update #2: I should add the .pdf is a fairly artificial way to check. It could serve something else and something like /taco could serve a PDF. You'll have to do actual GET\|HEAD requests to find out the reputed mime type of any given resource and verify it after receiving it.)	[reply] [d/l]
Re^2: Crawling Relative Links from Webpages by listanand (Sexton) on May 08, 2010 at 13:41 UTC
OK so maybe I am missing something here, because I am just unable to understand what's being said :( $mech above uses a hard coded link, which would of course work for this page. What about those from other domains (say "xyz.com")? How do I make the method generalizable?	[reply]
Re^3: Crawling Relative Links from Webpages by Corion (Patriarch) on May 08, 2010 at 14:34 UTC
There is only one hard-coded address in the code: `my $mech = WWW::Mechanize->new(); $mech->get("http://dspace.mit.edu/handle/1721.1/53720");` [download] If you want to make that variable, maybe you want to pass the starting link from the command line? It will then be available via `@ARGV`: `my $mech = WWW::Mechanize->new(); warn "Fetching $ARGV[0]\n"; $mech->get($ARGV[0]);` [download] Call it as `perl -w listanand.pl http://google.com` [download]	[reply] [d/l] [select]
Re^4: Crawling Relative Links from Webpages by listanand (Sexton) on May 08, 2010 at 15:32 UTC
Re^5: Crawling Relative Links from Webpages by Your Mother (Archbishop) on May 08, 2010 at 17:04 UTC
Some notes below your chosen depth have not been shown here
Re: Crawling Relative Links from Webpages by BrowserUk (Patriarch) on May 08, 2010 at 00:46 UTC
The pdf link on the example page is not relative to the current page. It starts with /, so is absolute path relative to the current server, so combining with base() isn't going to work. You need to combine it with the server to form the correct url, but Mech doesn't break that out for you. It will give you a URI, but that's documented in alien and so I have never been sure if it can give you the root address of the server or not. I've always used: `my( $server ) = $url =~ m[(^http://[^/]+/)];` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l]
Re^2: Crawling Relative Links from Webpages by listanand (Sexton) on May 08, 2010 at 01:34 UTC
Thanks for your reply. Well OK. The point is how do you determine $url? In this case, the $url is "http://dspace.mit.edu" and it is not at all obvious from the webpage (looking at the source) how one would say that this is the server. I have a million different kinds of such webpages from different servers. I need a method that is generic enough to work with all of them. Any suggestions anyone? Andy	[reply]
Re^3: Crawling Relative Links from Webpages by BrowserUk (Patriarch) on May 08, 2010 at 01:42 UTC
Something like: `my $uri = $mech->uri; my( $server ) = $url =~ m[(^http://[^/]+)/]; ... my $pdfurl = $server . $link;` [download] Note: There probably is some way of getting the appropriate portion of the url from URI without resorting to regex, but I've never worked out how. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l]
Re^4: Crawling Relative Links from Webpages by Anonymous Monk on May 08, 2010 at 03:44 UTC
Re^5: Crawling Relative Links from Webpages by BrowserUk (Patriarch) on May 08, 2010 at 03:54 UTC
Some notes below your chosen depth have not been shown here