chuckd has asked for the wisdom of the Perl Monks concerning the following question:

I have some links like so:
http://dynamodata.fdncenter.org/990s/990search/ffindershow.cgi?id=MITC065
http://dynamodata.fdncenter.org/990s/990search/ffindershow.cgi?id=MITC065
at the bootom of the pages there are links to pdf's.
whats the best way to get these pdf's?
I used LWP::Simple and got the contents of the page. It works!
can anyone give me advice on extracting the pdf's found in the page as href tags?

Replies are listed 'Best First'.
Re: web scrap question
by ig (Vicar) on Jul 28, 2009 at 21:44 UTC

    wget will do what you want.

    The site's robots.txt file lists the directory containing the pdf files as disallowed. Unless you have permission from the owner, you should respect the robots.txt file. The wget utility does this by default but it can be forced to ignore robots.txt if appropriate.

    wget with the following options will download the files:

    wget -r -l 1 -A .pdf -w 10 -e robots=off http://dynamodata.fdncenter. +org/990s/990search/ffindershow.cgi?id=MITC065
Re: web scrap question
by Your Mother (Archbishop) on Jul 28, 2009 at 22:14 UTC

    And in a lesson of how *not* to write a CGI try hitting the page without an id (or probably shouldn't): http://dynamodata.fdncenter.org/990s/990search/ffindershow.cgi. It prints itself in what I'm guessing is an endless loop. I got a couple hundred before I realized what was going on and closed the tab.

    If you know the owners of that site or are feeling charitable you might let them know. It's a pretty bad bug. Self-inflicted DoS.

Re: web scrap question
by mzedeler (Pilgrim) on Jul 28, 2009 at 21:10 UTC
Re: web scrap question
by jrsimmon (Hermit) on Jul 28, 2009 at 20:54 UTC
    Have you tried anything at all? Perhaps looked at WWW::Curl?
Re: web scrap question
by whakka (Hermit) on Jul 29, 2009 at 00:35 UTC
    This example from the WWW::Mechanize documentation seems to be exactly what you want to do.