Using text files to remove duplicates in a web crawler

mkurtis has asked for the wisdom of the Perl Monks concerning the following question:

I recieved help on building a very basic web crawler a while back from some monks, but I have another question. I am simply trying to keep the crawler from visiting the same web site twice. The original setup did this nicely, but after about 25,000 sites it says "Out of memory!" and stops. I have tried to make a replacement using a text file and grep, here it is: (original setup used hashes, they are commented out)

#!/usr/bin/perl -w
use strict;
use warnings;

use LWP::RobotUA;
use HTML::SimpleLinkExtor;
use Storable;
use HTML::Parser 3;

use vars qw/$http_ua $link_extractor/;
my %visited;
my $visited;

sub crawl {
#$visited = retrieve('/var/www/data/links');
#%visited = %{$visited};
    my @queue = @_;
        my $a = 0;
    my $base;
    while ( my $url = shift @queue ) {
    open(LINKS,"</var/www/data/links.txt");
    my @visited = grep /\b$url\S\b/, <LINKS>;
        next if defined $visited[0];
    close(LINKS);

        my $response = $http_ua->get($url);
        my $html = $response->content;
            
        open FILE, '>' . ++$a . '.txt';
            print FILE "$url\n";
        print FILE $html;
        #print FILE body_text($html);
            close FILE;

            print qq{Downloaded: "$url"\n};

            push @queue, do {
                    my $link_extractor = HTML::SimpleLinkExtor->new($u
+rl);
                    $link_extractor->parse($html);
                    $link_extractor->a;

            };
        open(LINKS,">>/var/www/data/links.txt");
        print LINKS $url . "\n";
        close(LINKS);
        @visited = undef;
        #$visited{$url} = 1;
           
    }  
    #store \%visited, '/var/www/data/links';
    #%visited = undef;
}

$http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com';
$http_ua->delay( 10 / 6000 );

crawl(@ARGV);

sub body_text {
    my $content = $_[0] || return 'EMPTY BODY';

    # HTML::Parser is broken on Javascript and styles 
    # (well it leaves it in the text) so we 'fix' it....

    my $p = HTML::Parser->new( 
        start_h => [ sub{ $_[0]->{text}.=' '; $_[0]->{skip}++ if $_[1]
+ eq 'script' or $_[1] eq 'style'; } , 'self,tag' ],
        end_h   => [ sub{ $_[0]->{skip}-- if $_[1] eq '/script' or $_[
+1] eq '/style'; } , 'self,tag' ], 
        text_h  => [ sub{ $_[0]->{text}.=$_[1] unless $_[0]->{skip}}, 
+'self,dtext' ]
        )->parse($content);
    
    $p->eof();
    my $text = $p->{text};
    
    # remove escapes
    $text =~ s/&nbsp;/ /gi;
    $text =~ s/&[^;]+;/ /g;

    # remove non ASCII printable chars, leaves punctuation stuff
    $text =~ s/[^\040-\177]+/ /g;

    # remove any < or > in case parser choked - rare but happens
    $text =~ s/[<>]/ /g;

    # crunch whitespace
    $text =~ s/\s{2,}/ /g;
    $text =~ s/^\s+//g;

  return $text;
}
[download]

For some reason this allows duplicates, any ideas? Thanks

Comment on Using text files to remove duplicates in a web crawler Download Code

Replies are listed 'Best First'.

Re: Using text files to remove duplicates in a web crawler
by PodMaster (Abbot) on Jul 07, 2004 at 03:56 UTC

For some reason this allows duplicates, any ideas?

perldoc -q duplicate

perhaps you should have been using a hash all along, eh?

`perldoc DB_File'
`perldoc AnyDBM_File'

Also, merlyn has written a couple of articles on writing spiders, so you should check those out.

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]

Re: Using text files to remove duplicates in a web crawler
by davido (Cardinal) on Jul 07, 2004 at 04:43 UTC

Instead of slurping in the entire %visited hash at once with storable, try using a lightweight database instead. You can use DBI and DBD::SQLite, for example. The point is that you probably need to give up on keeping the list of all sites you've visited in memory, so you need to come up with a solution that gives you quick random access to the data stored in more permanent storage like a hard drive. Flat files don't scale well, so a quick-and-dirty database would be ideal.

Each time you visit a site, plop its URL into the database. And each time you wish to consider following another link to a site, check to see if it's already in the database or not.

Dave

[reply]

Re^2: Using text files to remove duplicates in a web crawler

by matija (Priest) on Jul 07, 2004 at 06:28 UTC

cpan:://DBI

DB_File

[reply]

Re^2: Using text files to remove duplicates in a web crawler

by Scarborough (Hermit) on Jul 07, 2004 at 15:51 UTC

I whole heartedly agree with the above, and once in a database the infomation can be accessed by lots of other applications as well. Surely a massive advantage.

[reply]

Re: Using text files to remove duplicates in a web crawler
by Stevie-O (Friar) on Jul 07, 2004 at 07:40 UTC

Bloom::Filter

A Bloom Filter a simplified hashtable that can test only presence/absence (unlike a full-blown hashtable which can associate arbitrary data, such as a scalar, with each entry). They can encode this information in FAR less space than a normal hashtable.

Be warned! Bloom Filters are not perfect; they have a very small, but existent error rate of false positives/negatives that occur. By increasing the amount of storage per entry (decreasing the effective compression), however, the error rate can be reduced to quite reasonable proportions. As this article indicates, you can encode up to 1 *million* URLs and have a 1-in-1-million false positive rate while storing only 6 bytes per key. That's LESS than the 'www.' and '.com' you're almost guaranteed to find in every URL (8 bytes).

--Stevie-O

$"=$,,$_=q>|\p4<6 8p<M/_|<('=>
.q>.<4-KI<l|2$<6%s!<qn#F<>;$,
.=pack'N*',"@{[unpack'C*',$_]
}"for split/</;$_=$,,y[A-Z a-z]
         {}cd;print lc
[download]

[reply]
[d/l]