comment on

One tweak I might suggest: use a %seen_url hash to cache the URLs that have already been visited. The values are of course not important; you just want to add each URL as a key so you can do if $seen_url{$next_url} to skip links you've already followed once.

If you use this, then a single @queue array (push URLs from the current page on the back, shift next one to process off the front) will work just fine, as you'll discard anything you've already seen.

It also might be a good idea to not follow links that point off the site (like "search this site" custom search, etc.); a quick check of the host via URI can help with that.

This can still get caught by (for instance) calendar links that are CGI-only and of which there's an infinite supply. Adding support for trapping those is left out here.

I've used the URI::ImpliedBase module to handle sites that use relative links rather than absolute ones; this module automatically converts relative URLs to absolute ones, based on the last absolute one it saw. In the process of writing this script, I exposed a bug in URI::ImpliedBase which I need to fix (it changes the implied base for any absolute URI, so a mailto: breaks every relative URL that follows it...). (Edit: fixed in the 0.08 release, just uploaded to CPAN. The lines that can be removed are marked. URI::ImpliedBase now has an accepted_schemes list that it uses to determine whether to reset the base URI or not.)

use strict;
use warnings;
use WWW::Mechanize;                                                   
+          
use URI::ImpliedBase;
use URI;

my %visited;
my @queue;
my $start_url = shift or die "No starting URL supplied";

my $extractor = URI::ImpliedBase->new($start_url);
my $local_site = $extractor->host;
my $mech = WWW::Mechanize->new(autocheck=>0);

push @queue, $start_url;
while (@queue) {
    my $next_url = shift @queue;
    next unless $next_url;

    print STDERR $next_url,"\n";
    next if $visited{$next_url};

    ## Not needed with version 0.08 of URI::ImpliedBase; remove if you
+ have it
    my $scheme_checker = URI->new($next_url);
    next if $scheme_checker->scheme and $scheme_checker->scheme !~ /ht
+tp/;
    ## end of removable code 

    $extractor = URI::ImpliedBase->new($next_url);
    next if $extractor->host ne $local_site;

    $mech->get($extractor->as_string);
    next unless $mech->success;

    # Unseen, on this site, and we can read it.
    # Save that we saw it, grab links from it, process this page.
    $visited{$next_url}++;
    push @queue, map {$_->url} $mech->links;
    process($next_url, $mech->content);
}

sub process {
    my($url, $page_content) = @_;
    # Do as you like with the page content here...
    print $page_content;                                              
+          
}
[download]

I tested this on pemungkah.com, which is heavily self-linked with relative URLs, and points to a lot of external sites as well. It crawled it quite nicely.

In reply to Re^2: Building a Spidering Application by pemungkah
in thread Building a Spidering Application by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.