Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
How would you go about finding all the links on one page, finding all the links on the next page, etc, and continuously branch out until all links have been exhausted? I'm thinking this probably has to be done via a hash to ensure whether or not that particular link was already scanned.
My other question is, when things of this nature are done are the processes or data collected DURING this initial search? Or do we typically record all the possible links first, then use LWP::Simple and load each of our pages to do whatever we need to them?
Example code would be better than just posting a link to Some::Mod on Cpan. If this can't be done easily without other modules that may work too, but I'd rather not use anything Perl didn't come with.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Crawling all urls on a site
by thedoe (Monk) on Feb 20, 2005 at 04:47 UTC | |
by Cody Pendant (Prior) on Feb 20, 2005 at 11:12 UTC | |
|
Re: Crawling all urls on a site
by gaal (Parson) on Feb 20, 2005 at 05:56 UTC | |
|
Re: Crawling all urls on a site
by Popcorn Dave (Abbot) on Feb 20, 2005 at 03:31 UTC | |
|
Re: Crawling all urls on a site
by Realbot (Scribe) on Feb 20, 2005 at 14:01 UTC | |
|
Re: Crawling all urls on a site
by ambrus (Abbot) on Feb 20, 2005 at 14:31 UTC | |
by Anonymous Monk on Feb 20, 2005 at 18:40 UTC | |
by PodMaster (Abbot) on Feb 21, 2005 at 04:31 UTC |