mkurtis has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a web crawler. So far I've got some logic for changing relative urls to absolute urls, and a foreach loop to iterate through the links. It generally looks like this:
#!/usr/bin/perl -w use LWP::Simple; use HTML::SimpleLinkExtor; use Data::Dumper; use LWP::RobotUA; use HTTP::Response; $_="http://www.frozenhosting.com"; my $ua = LWP::RobotUA->new("theusefulbot", "akurtis3 at yahoo.com"); $ua->delay(10/60); my $content= $ua->get($_); my $extor = HTML::SimpleLinkExtor->new(); $extor->parse($content); my @links=$extor->a; print "start"; foreach $links (@links) { } print $content;
Yes I know it's missing the logic for urls, and a hash to store it in, but I need to have a loop to tie the entire thing together, I used to open a file and enclose the whole thing in a while loop, but that was when i was mistakenly thinking that I could read from and append to the same file. Here are the past crawler attempts: Useless use of substr in void context

Thanks,

Replies are listed 'Best First'.
Re: Writing a Web Crawler
by kvale (Monsignor) on Feb 25, 2004 at 05:23 UTC
    The module WWW::Robot implements the logic you are trying to recreate. I'd recommend using the module directly, but if you want to roll your own, try looking at that module's source code for ideas.

    -Mark

Re: Writing a Web Crawler
by perrin (Chancellor) on Feb 25, 2004 at 21:35 UTC
Re: Writing a Web Crawler
by petdance (Parson) on Feb 26, 2004 at 02:28 UTC
    WWW::Mechanize makes what you're doing nearly trivial. Please don't go to any more trouble rolling your own...

    xoxo,
    Andy

      Great Andy. Do you have any code for WWW::mechanize? Thanks