Crawling all urls on a site

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Using only modules that come prepackaged with Perl (I'd love it even more if it were just LWP::Simple and CGI), I need to load a domain name and float through their pages for statistic information.

How would you go about finding all the links on one page, finding all the links on the next page, etc, and continuously branch out until all links have been exhausted? I'm thinking this probably has to be done via a hash to ensure whether or not that particular link was already scanned.

My other question is, when things of this nature are done are the processes or data collected DURING this initial search? Or do we typically record all the possible links first, then use LWP::Simple and load each of our pages to do whatever we need to them?

Example code would be better than just posting a link to Some::Mod on Cpan. If this can't be done easily without other modules that may work too, but I'd rather not use anything Perl didn't come with.

Comment on Crawling all urls on a site

Replies are listed 'Best First'.

Re: Crawling all urls on a site
by thedoe (Monk) on Feb 20, 2005 at 04:47 UTC

Using LWP::Simple you are able to first get the web page like so:

$src = get($pageBase);

Then you can use regex to get the links within that source. You will have to check for links that use ' and ". For example:

my @pageLinks = ();
push @pageLinks, $1 for $src =~ /<a href='([^']+)'/gs;
push @pageLinks, $1 for $src =~ /<a href="([^"]+)"/gs;
[download]

One thing you will have to look out for is whether a link is relative or absolute. Unfortunately, this could make it extremely difficult to check whether a link has already been visited. If someone is using the syntax of .. within their links (I know not a good idea, but it has been done) then you will have to account for that in determining the actual link before adding it to your hash.

However, I would recommend using one of the other modules on CPAN. They have generally been well tested and are often the most efficient ways of doing something. One that you can take a look at is WWW::SimpleRobot.

Update: One more thing of note, if you will be crawling sites make sure to give a decent delay in between gets. If all links you will be crawling are on the same server, pounding it with requests will not be appreciated at all.

[reply]
[d/l]
[select]

Re^2: Crawling all urls on a site

by Cody Pendant (Prior) on Feb 20, 2005 at 11:12 UTC

<a href="">

But as you say, first one has to get all the links, which is relatively trivial, then one has to filter them for already being visited and identical, so you'll need to not only figure out what "../../../" means at any one point, but you'll need to figure out that "../../../index.htm" is the same thing. Though of course it might be "../../../index.html" or "../../../default.htm" or something else again. Is "../../../index.htm?x=y" the same thing?

The short answer is, of course it can be done. But the question is, yet again, why are we trying to do it without modules?

And by the way, what about wget?

($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
=~y~b-v~a-z~s; print

[reply]
[d/l]

Re: Crawling all urls on a site
by gaal (Parson) on Feb 20, 2005 at 05:56 UTC

Creating a web crawler (theory)

[reply]

Re: Crawling all urls on a site
by Popcorn Dave (Abbot) on Feb 20, 2005 at 03:31 UTC

Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.

[reply]

Re: Crawling all urls on a site
by Realbot (Scribe) on Feb 20, 2005 at 14:01 UTC

WWW::Mechanize

[reply]

Re: Crawling all urls on a site
by ambrus (Abbot) on Feb 20, 2005 at 14:31 UTC

Using only modules that come prepackaged with Perl (I'd love it even more if it were just LWP::Simple and CGI), ...

LWP::Simple does not come prepackaged with perl.

[reply]

Re^2: Crawling all urls on a site

by Anonymous Monk on Feb 20, 2005 at 18:40 UTC

I have to beg to differ. I've installed Perl on a number of my Win machines and never once did I have to install that module. It's always been there.

[reply]

Re^3: Crawling all urls on a site

by PodMaster (Abbot) on Feb 21, 2005 at 04:31 UTC

...It's always been there.

perl

Module::CoreList

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]