Spidering websites

Whitchman has asked for the wisdom of the Perl Monks concerning the following question:

Is there anyway (stupid question, there's always a way) to have a perl script start at one URL and follow the links on the page and download certain files (like jpegs over 16 KB). I need it to be very specific in what pages it gets, like how far from the original URL it will go. I also need it to then sort the files into directories with the same structure it downloaded them from. Example: if a file came from "original_url/images/set1/image6.jpg" I would want it to go to something like "C:/images/set1/image6.jpg" and do that for all the images. Get it?

Comment on Spidering websites

Replies are listed 'Best First'.
Re: Spidering websites by tachyon (Chancellor) on Apr 09, 2002 at 03:28 UTC
Link Checker is a web spider script and at this node the illustrious merlyn adds links to 4 spiders of his own. You should have little trouble modifying these scripts. cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]
Re: Spidering websites by Chmrr (Vicar) on Apr 09, 2002 at 04:23 UTC
If you're looking at writing a spider, the WWW::Robot module should get some airtime. I used it with great success a while back to slurp the contents of a rather large and complex zoo of static pages into a dynamic engine. Especially cool in my eyes because it uses HTML::Treebuilder, which I also happen to like. perl -pe '"I lo`+$^X$\"$]!$/"=~m%(.)%s;$_=$1;y^`+*^e v^#$&V"+@( NO CARRIER'	[reply]
Re: Spidering websites by CukiMnstr (Deacon) on Apr 09, 2002 at 03:00 UTC
...and if you really want to do it in perl, then you can check LWP::RobotUA (and the other LWP:: modules), and then do a quick search here in the monastery to find some scripts that might guide you. hope this helps, Update: Changed LWP::UserAgent to LWP::RobotUA, thanks belg4mit.	[reply]
Re: Re: Spidering websites by belg4mit (Prior) on Apr 09, 2002 at 04:15 UTC
Umm better than that LWP::RobotUA, behave yourself. `-- perl -pe "s/\b;([mnst])/'\1/mg"`	[reply]
Re: Spidering websites by premchai21 (Curate) on Apr 09, 2002 at 02:53 UTC
Try wget.	[reply]


Perl Monk, Perl Meditation
	PerlMonks