web scrap question

chuckd has asked for the wisdom of the Perl Monks concerning the following question:

I have some links like so:
http://dynamodata.fdncenter.org/990s/990search/ffindershow.cgi?id=MITC065
http://dynamodata.fdncenter.org/990s/990search/ffindershow.cgi?id=MITC065
at the bootom of the pages there are links to pdf's.
whats the best way to get these pdf's?
I used LWP::Simple and got the contents of the page. It works!
can anyone give me advice on extracting the pdf's found in the page as href tags?

Comment on web scrap question

Replies are listed 'Best First'.
Re: web scrap question by ig (Vicar) on Jul 28, 2009 at 21:44 UTC
wget will do what you want. The site's robots.txt file lists the directory containing the pdf files as disallowed. Unless you have permission from the owner, you should respect the robots.txt file. The wget utility does this by default but it can be forced to ignore robots.txt if appropriate. wget with the following options will download the files: `wget -r -l 1 -A .pdf -w 10 -e robots=off http://dynamodata.fdncenter. +org/990s/990search/ffindershow.cgi?id=MITC065` [download]	[reply] [d/l]
Re: web scrap question by Your Mother (Archbishop) on Jul 28, 2009 at 22:14 UTC
And in a lesson of how not to write a CGI try hitting the page without an id (or probably shouldn't): http://dynamodata.fdncenter.org/990s/990search/ffindershow.cgi. It prints itself in what I'm guessing is an endless loop. I got a couple hundred before I realized what was going on and closed the tab. If you know the owners of that site or are feeling charitable you might let them know. It's a pretty bad bug. Self-inflicted DoS.	[reply]
Re: web scrap question by mzedeler (Pilgrim) on Jul 28, 2009 at 21:10 UTC
Try looking at WWW::Mechanize.	[reply]
Re: web scrap question by jrsimmon (Hermit) on Jul 28, 2009 at 20:54 UTC
Have you tried anything at all? Perhaps looked at WWW::Curl?	[reply]
Re: web scrap question by whakka (Hermit) on Jul 29, 2009 at 00:35 UTC
This example from the WWW::Mechanize documentation seems to be exactly what you want to do.	[reply]