Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider

As you see, originally I posted a piece of pseudo-code. It described logic. The algorithm. I used perl syntax and made the block into a sub to help you with the actual implementation.

Below is a piece of REAL executable code in realworld perl. It actually crawls the web (provide it with urls in command-line). And I --you, sorry.

Just run it as a separate script, no need to "put it into" your code.

#/usr/bin/perl -w
use strict;

use LWP::RobotUA;
use HTML::SimpleLinkExtor;

use vars qw/$http_ua $link_extractor/;

sub crawl {
    my @queue = @_;
    my %visited;

    while(my $url = shift @queue) {
        next if $visited{$url};

        my $content = $http_ua->get($url)->content;
        # do useful things with $content
        # for example, save it into a file or index or whatever
        # i just print the url
        print qq{Downloaded: "$url"\n};

        push @queue,
        do { $link_extractor->parse($content); $link_extractor->a };
        $visited{$url} = 1;
    }
}

$http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com';
$link_extractor = new HTML::SimpleLinkExtor;

crawl(@ARGV);
[download]

Comment on Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider Download Code

Replies are listed 'Best First'.
Re: Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by mkurtis (Scribe) on Mar 13, 2004 at 18:05 UTC
Thanks so much kappa, i sure wish i could vote more than once for your post. But I have some problems still, how do I make it follow the links it extracts, it just stops. For example, when I start it on wired.com, it creates 77 files and then brings up the command prompt. I have modified your code into this: #!/usr/bin/perl -w use strict; use LWP::RobotUA; use HTML::SimpleLinkExtor; use vars qw/$http_ua $link_extractor/; my @queue; @queue = qw ("http://www.wired.com"); sub crawl { my $a = 0; my %visited; my $links; my @links; while(my $url = shift @queue) { next if $visited{$url}; my $content = $http_ua->get($url)->content; open(FILE,">/var/www/data/$a.txt"); print FILE "$url\n"; print FILE "$content"; close(FILE); print qq{Downloaded: "$url"\n}; push @queue, do { $link_extractor->parse($content); @links = $link_extractor->a }; foreach $links(@links) { unshift @queue, $links; } $visited{$url} = 1; $a++; } } $http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com'; $http_ua->delay(10/6000); $link_extractor = new HTML::SimpleLinkExtor; crawl(@ARGV); [download] Also, what do I do when the array gets too large? Thanks again	[reply] [d/l]
Re: Re: Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by kappa (Chaplain) on Mar 14, 2004 at 20:51 UTC
You `unshift` links into the queue after pushing them there several lines above. That's weird, but does not matter as it never crawls to the same url for two times. My original code did everything you need about links and queueing, btw. Next, I can't debug mirroring wired.com, sorry :) I pay for traffic. Try to watch the growing queue of pending visits and catch the moment your script finishes. And last. Your arrays won't get too large anytime soon. Really. Your computer will be able to handle an array of million of links, I suppose, without much problems. I'd suggest filtering visited links before adding new ones to the queue and not before crawling as the first possible optimization.	[reply] [d/l]
Re: Re: Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by mkurtis (Scribe) on Mar 14, 2004 at 21:57 UTC
How do I get it to go to other pages though. When I visit wired.com for example, I want it to take all the links off of it and visit them. And for each page that it visits off of wired, take the links off those pages and visit them and so on. This one only takes the links off of wired.com, not any of the pages that are linked to wired. Thank you	[reply]