I am trying to use WWW::Mechanize to build a crawler so that it is able to crawl a (large) set of webpages and able to pull out all the pdfs that each webpage hosts.
I am running into some trouble with crawling relative links out of webpages. Some webpages have only "relative" links for certain types of files, and my crawler does not get them right.
I use Mechanize to retrieve the base URL ($mech->base()) and then append it to the HREF entry of the PDF but that does not seem to work either.
I am writing the crawler for doing crawls internally, but here's one example webpage on WWW that would be a case in point: http://dspace.mit.edu/handle/1721.1/53720 So the question is how do I adapt my crawler to crawl PDFs from such webpages?
Any suggestions will be very gratefully appreciated.
Thank you.
Andy
PS: By the way, I tried using HTML::LinkExtor also, and it does not work either. It does not produce "right" URL for the PDF. It again appends the "base" URL to the relative URL just as I did manually with Mechanize above.
In reply to Crawling Relative Links from Webpages by listanand
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |