Building a Spidering Application

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Building a Spidering Application by roboticus (Chancellor) on Jul 06, 2012 at 17:35 UTC
I'd suggest something like: `my @urls = ($url); my @next_set; while (@urls) { # Storage for links my @next_set; for my $url (@urls) { # Get the page & such $mechanize->get($url); my $page = $mechanize->content; my $title = $mechanize->title; print "<b>$title</b><br />"; # add the pages links to the next list push @next_set, $mechanize->links; } # OK, we've processed all in @urls, so load @urls # with all the links we've found since last time, # and start over again. @urls = @next_set; }` [download] I use two arrays here because it's generally not a good idea to modify an array you're iterating on. So we basically use the second array as a bucket to hold all the links we find while processing the first array. Then, when we finish the first array, we reload it and start again. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l]
Re^2: Building a Spidering Application by pemungkah (Priest) on Jul 08, 2012 at 17:48 UTC
One tweak I might suggest: use a `%seen_url` hash to cache the URLs that have already been visited. The values are of course not important; you just want to add each URL as a key so you can do `if $seen_url{$next_url}` to skip links you've already followed once. If you use this, then a single `@queue` array (push URLs from the current page on the back, shift next one to process off the front) will work just fine, as you'll discard anything you've already seen. It also might be a good idea to not follow links that point off the site (like "search this site" custom search, etc.); a quick check of the host via URI can help with that. This can still get caught by (for instance) calendar links that are CGI-only and of which there's an infinite supply. Adding support for trapping those is left out here. I've used the URI::ImpliedBase module to handle sites that use relative links rather than absolute ones; this module automatically converts relative URLs to absolute ones, based on the last absolute one it saw. In the process of writing this script, I exposed a bug in `URI::ImpliedBase` which I need to fix (it changes the implied base for any absolute URI, so a `mailto:` breaks every relative URL that follows it...). (Edit: fixed in the 0.08 release, just uploaded to CPAN. The lines that can be removed are marked. URI::ImpliedBase now has an accepted_schemes list that it uses to determine whether to reset the base URI or not.) use strict; use warnings; use WWW::Mechanize; + use URI::ImpliedBase; use URI; my %visited; my @queue; my $start_url = shift or die "No starting URL supplied"; my $extractor = URI::ImpliedBase->new($start_url); my $local_site = $extractor->host; my $mech = WWW::Mechanize->new(autocheck=>0); push @queue, $start_url; while (@queue) { my $next_url = shift @queue; next unless $next_url; print STDERR $next_url,"\n"; next if $visited{$next_url}; ## Not needed with version 0.08 of URI::ImpliedBase; remove if you + have it my $scheme_checker = URI->new($next_url); next if $scheme_checker->scheme and $scheme_checker->scheme !~ /ht +tp/; ## end of removable code $extractor = URI::ImpliedBase->new($next_url); next if $extractor->host ne $local_site; $mech->get($extractor->as_string); next unless $mech->success; # Unseen, on this site, and we can read it. # Save that we saw it, grab links from it, process this page. $visited{$next_url}++; push @queue, map {$_->url} $mech->links; process($next_url, $mech->content); } sub process { my($url, $page_content) = @_; # Do as you like with the page content here... print $page_content; + } [download] I tested this on `pemungkah.com`, which is heavily self-linked with relative URLs, and points to a lot of external sites as well. It crawled it quite nicely.	[reply] [d/l]
Re^3: Building a Spidering Application by Your Mother (Archbishop) on Jul 09, 2012 at 15:25 UTC
You don't need URI::ImpliedBase. WWW::Mechanize::Link objects that Mech uses/returns have a method, `url_abs`, to cover this. Of course then it's up to the spider to decide if query params are relevant or duplicates or no-ops and, in the hacky world of HTML4.9, if fragments are meaningful (but only JS aware Mech would be able to care here).	[reply] [d/l]
Re^4: Building a Spidering Application by pemungkah (Priest) on Jul 09, 2012 at 16:17 UTC
Re^5: Building a Spidering Application by Your Mother (Archbishop) on Jul 09, 2012 at 18:22 UTC
Re^3: Building a Spidering Application by roboticus (Chancellor) on Jul 08, 2012 at 23:01 UTC
perlmungkah: Yes, those are very good improvements. I wish I thought of them when I was originally replying! ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re: Building a Spidering Application by Anonymous Monk on Jul 06, 2012 at 14:55 UTC
At the end I want to loop back and download the links in order to parse useful information. Please help! --Miriam See Re: Perl :: Mechanize - running a single while loop	[reply]
Re^2: Building a Spidering Application by mailmeakhila (Sexton) on Jul 06, 2012 at 15:02 UTC
Why dont you add another $mech->get($url) in your foreach loop? I would also suggest using Web::Scraper.	[reply]
Re^3: Building a Spidering Application by MiriamH (Novice) on Jul 06, 2012 at 15:08 UTC
but how do I make the new $URL the subsequent HTML that is dowloaded?	[reply]
Re^4: Building a Spidering Application by choroba (Cardinal) on Jul 06, 2012 at 15:15 UTC
Re^2: Building a Spidering Application by MiriamH (Novice) on Jul 06, 2012 at 15:10 UTC
I have a flow chart- I download the original webpage and print out the content as well as all the links. I then want to get( $URL) of each of the URL's that are linked to on the original webpage. I don't know how to write a code that will get each subsequent URL.	[reply]
Re^3: Building a Spidering Application by Corion (Patriarch) on Jul 06, 2012 at 15:18 UTC
Have you looked at the `->get` method of WWW::Mechanize? In fact, you already use it yourself when loading the initial page. Maybe consider using it also when you want to download another page. You can also look at the `->follow_link` method in the same documentation. But note that `->follow_link` will only work for following one link on a page.	[reply] [d/l] [select]