Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
What I'm trying to do is put in a $url and have it scan that page for links. Then, if it finds any, it'll go to the first link and branch off as many times as it can until all pages it has access and links to are scanned.
I could probably manage using LWP::Simple or maybe LWP::UserAgent to scrape the main page for the links and possibly do it for the first link after that. But when speaking of branching out to get all the links from one page (like a tree of links), I have NO idea where to begin.
And using regexes would probably be a pain because webmasters don't always use FULL URLS like they should. Then you have to scan dynamic URLs like /cgi-bin/script.cgi?param=12&name=test .
Are there modules to do this type of thing I could use? Any advice would be much appreciated.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Creating a web crawler (theory)
by brian_d_foy (Abbot) on Jan 28, 2005 at 17:30 UTC | |
|
Re: Creating a web crawler (theory)
by hardburn (Abbot) on Jan 28, 2005 at 17:42 UTC | |
|
Re: Creating a web crawler (theory)
by Fletch (Bishop) on Jan 28, 2005 at 18:07 UTC | |
|
Re: Creating a web crawler (theory)
by Zaxo (Archbishop) on Jan 28, 2005 at 17:37 UTC | |
|
Re: Creating a web crawler (theory)
by gaal (Parson) on Jan 29, 2005 at 14:14 UTC | |
|
Re: Creating a web crawler (theory)
by ww (Archbishop) on Jan 28, 2005 at 20:10 UTC | |
by brian_d_foy (Abbot) on Jan 28, 2005 at 21:01 UTC | |
by gaal (Parson) on Jan 29, 2005 at 14:10 UTC | |
by brian_d_foy (Abbot) on Jan 29, 2005 at 15:42 UTC | |
by gaal (Parson) on Jan 29, 2005 at 16:26 UTC | |
|