Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Spidering websites

by Whitchman (Novice)
on Apr 09, 2002 at 02:38 UTC ( [id://157635]=perlquestion: print w/replies, xml ) Need Help??

Whitchman has asked for the wisdom of the Perl Monks concerning the following question:

Is there anyway (stupid question, there's always a way) to have a perl script start at one URL and follow the links on the page and download certain files (like jpegs over 16 KB). I need it to be very specific in what pages it gets, like how far from the original URL it will go. I also need it to then sort the files into directories with the same structure it downloaded them from. Example: if a file came from "original_url/images/set1/image6.jpg" I would want it to go to something like "C:/images/set1/image6.jpg" and do that for all the images. Get it?

Replies are listed 'Best First'.
Re: Spidering websites
by tachyon (Chancellor) on Apr 09, 2002 at 03:28 UTC

    Link Checker is a web spider script and at this node the illustrious merlyn adds links to 4 spiders of his own. You should have little trouble modifying these scripts.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Spidering websites
by Chmrr (Vicar) on Apr 09, 2002 at 04:23 UTC

    If you're looking at writing a spider, the WWW::Robot module should get some airtime. I used it with great success a while back to slurp the contents of a rather large and complex zoo of static pages into a dynamic engine. Especially cool in my eyes because it uses HTML::Treebuilder, which I also happen to like.

    perl -pe '"I lo*`+$^X$\"$]!$/"=~m%(.*)%s;$_=$1;y^`+*^e v^#$&V"+@( NO CARRIER'

Re: Spidering websites
by CukiMnstr (Deacon) on Apr 09, 2002 at 03:00 UTC
    ...and if you *really* want to do it in perl, then you can check LWP::RobotUA (and the other LWP:: modules), and then do a quick search here in the monastery to find some scripts that might guide you.

    hope this helps,

    Update: Changed LWP::UserAgent to LWP::RobotUA, thanks belg4mit.

      Umm better than that LWP::RobotUA, behave yourself.

      --
      perl -pe "s/\b;([mnst])/'\1/mg"

Re: Spidering websites
by premchai21 (Curate) on Apr 09, 2002 at 02:53 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://157635]
Approved by rob_au
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (4)
As of 2024-04-26 01:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found