listanand has asked for the wisdom of the Perl Monks concerning the following question:
I am trying to use WWW::Mechanize to build a crawler so that it is able to crawl a (large) set of webpages and able to pull out all the pdfs that each webpage hosts.
I am running into some trouble with crawling relative links out of webpages. Some webpages have only "relative" links for certain types of files, and my crawler does not get them right.
I use Mechanize to retrieve the base URL ($mech->base()) and then append it to the HREF entry of the PDF but that does not seem to work either.
I am writing the crawler for doing crawls internally, but here's one example webpage on WWW that would be a case in point: http://dspace.mit.edu/handle/1721.1/53720 So the question is how do I adapt my crawler to crawl PDFs from such webpages?
Any suggestions will be very gratefully appreciated.
Thank you.
Andy
PS: By the way, I tried using HTML::LinkExtor also, and it does not work either. It does not produce "right" URL for the PDF. It again appends the "base" URL to the relative URL just as I did manually with Mechanize above.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Crawling Relative Links from Webpages
by Your Mother (Archbishop) on May 08, 2010 at 02:25 UTC | |
by listanand (Sexton) on May 08, 2010 at 13:41 UTC | |
by Corion (Patriarch) on May 08, 2010 at 14:34 UTC | |
by listanand (Sexton) on May 08, 2010 at 15:32 UTC | |
by Your Mother (Archbishop) on May 08, 2010 at 17:04 UTC | |
| |
|
Re: Crawling Relative Links from Webpages
by BrowserUk (Patriarch) on May 08, 2010 at 00:46 UTC | |
by listanand (Sexton) on May 08, 2010 at 01:34 UTC | |
by BrowserUk (Patriarch) on May 08, 2010 at 01:42 UTC | |
by Anonymous Monk on May 08, 2010 at 03:44 UTC | |
by BrowserUk (Patriarch) on May 08, 2010 at 03:54 UTC | |
|