Writing a Web Crawler

mkurtis has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a web crawler. So far I've got some logic for changing relative urls to absolute urls, and a foreach loop to iterate through the links. It generally looks like this:

#!/usr/bin/perl -w

use LWP::Simple;
use HTML::SimpleLinkExtor;
use Data::Dumper;
use LWP::RobotUA;
use HTTP::Response;

$_="http://www.frozenhosting.com";
my $ua = LWP::RobotUA->new("theusefulbot", "akurtis3 at yahoo.com");
$ua->delay(10/60);
my $content= $ua->get($_);
my $extor = HTML::SimpleLinkExtor->new();
$extor->parse($content);

my @links=$extor->a;
print "start";
foreach $links (@links) {
        
}
print $content;
[download]

Yes I know it's missing the logic for urls, and a hash to store it in, but I need to have a loop to tie the entire thing together, I used to open a file and enclose the whole thing in a while loop, but that was when i was mistakenly thinking that I could read from and append to the same file. Here are the past crawler attempts: Useless use of substr in void context

Thanks,

Comment on Writing a Web Crawler Download Code

Replies are listed 'Best First'.
Re: Writing a Web Crawler by kvale (Monsignor) on Feb 25, 2004 at 05:23 UTC
The module WWW::Robot implements the logic you are trying to recreate. I'd recommend using the module directly, but if you want to roll your own, try looking at that module's source code for ideas. -Mark	[reply]
Re: Writing a Web Crawler by perrin (Chancellor) on Feb 25, 2004 at 21:35 UTC
WWW::Mechanize makes this kind of thing easier. Also, see the crawlers written by merlyn in his columns.	[reply]
Re: Writing a Web Crawler by petdance (Parson) on Feb 26, 2004 at 02:28 UTC
WWW::Mechanize makes what you're doing nearly trivial. Please don't go to any more trouble rolling your own... xoxo, Andy	[reply]
Writing a Web Crawler by mkurtis (Scribe) on Feb 26, 2004 at 04:45 UTC
Great Andy. Do you have any code for WWW::mechanize? Thanks	[reply]
Re: Writing a Web Crawler by petdance (Parson) on Feb 27, 2004 at 03:05 UTC
Go to the WWW::Mechanize web page and see the Examples. http://search.cpan.org/dist/WWW-Mechanize xoxo, Andy	[reply]