Whitchman has asked for the wisdom of the Perl Monks concerning the following question:
Is there anyway (stupid question, there's always a way) to have a perl script start at one URL and follow the links on the page and download certain files (like jpegs over 16 KB). I need it to be very specific in what pages it gets, like how far from the original URL it will go. I also need it to then sort the files into directories with the same structure it downloaded them from. Example: if a file came from "original_url/images/set1/image6.jpg" I would want it to go to something like "C:/images/set1/image6.jpg" and do that for all the images. Get it?
Re: Spidering websites
by tachyon (Chancellor) on Apr 09, 2002 at 03:28 UTC
|
Link Checker is a web spider script and at this node the illustrious merlyn adds links to 4 spiders of his own.
You should have little trouble modifying these scripts.
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
| [reply] |
Re: Spidering websites
by Chmrr (Vicar) on Apr 09, 2002 at 04:23 UTC
|
If you're looking at writing a spider, the WWW::Robot module should get some airtime. I used it with great success a while back to slurp the contents of a rather large and complex zoo of static pages into a dynamic engine. Especially cool in my eyes because it uses HTML::Treebuilder, which I also happen to like.
perl -pe '"I lo*`+$^X$\"$]!$/"=~m%(.*)%s;$_=$1;y^`+*^e v^#$&V"+@( NO CARRIER'
| [reply] |
Re: Spidering websites
by CukiMnstr (Deacon) on Apr 09, 2002 at 03:00 UTC
|
...and if you *really* want to do it in perl, then you can check LWP::RobotUA (and the other LWP:: modules), and then do a quick search here in the monastery to find some scripts that might guide you.
hope this helps,
Update: Changed LWP::UserAgent to LWP::RobotUA, thanks belg4mit. | [reply] |
|
| [reply] |
Re: Spidering websites
by premchai21 (Curate) on Apr 09, 2002 at 02:53 UTC
|
| [reply] |
|