in reply to Creating a web crawler (theory)
It's pretty easy to do recursively:
sub spider { my $page = shift; return if page_already_spidered( $page ); my $mech = get_WWW_Mechanize( $page ); spider( $_ ) for $mech->links; # Perform scraping of page return; }
LWP::UserAgent with HTML::LinkExtractor works here, but WWW::Mechanize combines both of those for you.
Update: For politeness, you could add a sleep 1 at the top of the spider() subroutine. That should keep the load on the server down.
"There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.
|
|---|