comment on

I recieved help on building a very basic web crawler a while back from some monks, but I have another question. I am simply trying to keep the crawler from visiting the same web site twice. The original setup did this nicely, but after about 25,000 sites it says "Out of memory!" and stops. I have tried to make a replacement using a text file and grep, here it is: (original setup used hashes, they are commented out)

#!/usr/bin/perl -w
use strict;
use warnings;

use LWP::RobotUA;
use HTML::SimpleLinkExtor;
use Storable;
use HTML::Parser 3;

use vars qw/$http_ua $link_extractor/;
my %visited;
my $visited;

sub crawl {
#$visited = retrieve('/var/www/data/links');
#%visited = %{$visited};
    my @queue = @_;
        my $a = 0;
    my $base;
    while ( my $url = shift @queue ) {
    open(LINKS,"</var/www/data/links.txt");
    my @visited = grep /\b$url\S\b/, <LINKS>;
        next if defined $visited[0];
    close(LINKS);

        my $response = $http_ua->get($url);
        my $html = $response->content;
            
        open FILE, '>' . ++$a . '.txt';
            print FILE "$url\n";
        print FILE $html;
        #print FILE body_text($html);
            close FILE;

            print qq{Downloaded: "$url"\n};

            push @queue, do {
                    my $link_extractor = HTML::SimpleLinkExtor->new($u
+rl);
                    $link_extractor->parse($html);
                    $link_extractor->a;

            };
        open(LINKS,">>/var/www/data/links.txt");
        print LINKS $url . "\n";
        close(LINKS);
        @visited = undef;
        #$visited{$url} = 1;
           
    }  
    #store \%visited, '/var/www/data/links';
    #%visited = undef;
}

$http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com';
$http_ua->delay( 10 / 6000 );

crawl(@ARGV);

sub body_text {
    my $content = $_[0] || return 'EMPTY BODY';

    # HTML::Parser is broken on Javascript and styles 
    # (well it leaves it in the text) so we 'fix' it....

    my $p = HTML::Parser->new( 
        start_h => [ sub{ $_[0]->{text}.=' '; $_[0]->{skip}++ if $_[1]
+ eq 'script' or $_[1] eq 'style'; } , 'self,tag' ],
        end_h   => [ sub{ $_[0]->{skip}-- if $_[1] eq '/script' or $_[
+1] eq '/style'; } , 'self,tag' ], 
        text_h  => [ sub{ $_[0]->{text}.=$_[1] unless $_[0]->{skip}}, 
+'self,dtext' ]
        )->parse($content);
    
    $p->eof();
    my $text = $p->{text};
    
    # remove escapes
    $text =~ s/&nbsp;/ /gi;
    $text =~ s/&[^;]+;/ /g;

    # remove non ASCII printable chars, leaves punctuation stuff
    $text =~ s/[^\040-\177]+/ /g;

    # remove any < or > in case parser choked - rare but happens
    $text =~ s/[<>]/ /g;

    # crunch whitespace
    $text =~ s/\s{2,}/ /g;
    $text =~ s/^\s+//g;

  return $text;
}
[download]

For some reason this allows duplicates, any ideas? Thanks

In reply to Using text files to remove duplicates in a web crawler by mkurtis

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.