Re: Creating a web crawler (theory)

It's pretty easy to do recursively:

sub spider 
{
    my $page = shift;
    return if page_already_spidered( $page );

    my $mech = get_WWW_Mechanize( $page );
    spider( $_ ) for $mech->links;

    # Perform scraping of page

    return;
}
[download]

LWP::UserAgent with HTML::LinkExtractor works here, but WWW::Mechanize combines both of those for you.

Update: For politeness, you could add a sleep 1 at the top of the spider() subroutine. That should keep the load on the server down.

"There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

Comment on Re: Creating a web crawler (theory) Select or Download Code