My First Real Program: YAPS (Yet Another Perl Spider)

This post is somewhat of a follow-up to node_id=272046. In the process of writing a spider to download images from a web site, I ended up writing a module that has more universal appeal. It's a OO module that crawls a web site and finds all the documents on it that are likely to contain HTML code (it avoids multimedia files, images, and other weird content).

What's good about it is that once you have this list of docs, it can be used as a master list of all documents of the site which can then be recrawled to find/download more specific kinds of content. I realize it is wasteful to crawl a site again to get info after it's already been crawled. In the next release, I'll have the object store the entire crawled web page and then include methods to print/recrawl them. For now, I just want to KISS and make sure I get this module off on strong footing.

Anyway, this is my first crack at a Perl OO module or any kind of program, for that matter, that is intended for a wider audience than me. But it certainly wouldn't be a true Perl module without some feedback from my brothers and sisters at Perlmonks. I welcome all feedback, constructive or otherwise, to help make my contribution to the Perl community as good as it can be.

Thanks! Here it is:

package SiteDocSpider;

use strict;
use warnings;
use LWP::Simple;
use HTML::Parser;
use URI;
use constant NOT_PARSED => 0;
use constant PARSED => 1;


sub new {
    my ($class) = @_;
    bless {
        #_doc_list is the module's MIP (most important variable)
        #It is the hash that contains "a list of documents most likely
+ to contain HTML on the website."
        #The keys are the URLs of the docs.
        #The value of each _doc_list hash entry is a flag.
        #When the flag is set to 'NOT PARSED', it means the document n
+eeds to be parsed to examine it for links to additional documents.
        #When set to 'PARSED', it indicates the document has already b
+een examined for links. 
        _doc_list        => {},    
        
        _url             => $_[1],
        
        #Don't attempt to fetch content from documents with the follow
+ing extenstions as
        #they probably won't contain any HTML to examine for more link
+s
        _avoid_ext        => 'wmv|avi|rm|mpg|asf|ram|asx|mpeg|mp3|mpe|
+wav|jpg|jpeg|bmp|gif|png|xbm|xmp|swf|doc|xls|txt|ppt|pdf',
                            
        #$base stores a URI object for the documents about to be parse
+d.
        #$authority stores the authority for the URI object found in $
+base.
        #I put them in the object to avoid using a global variables.
        #See while loop below
        _base            => '',
        _authority        => '',    
    }, $class;
}

#Method that prints out collected URLs as a 1 column list
sub print_doc_list {
    my ($self) = @_;
    my $doc_list = $self->{_doc_list};
    {
        local $, = "\n";
        print "\n\n";
        print "The following documents on $self->{_url} were found";
        print "\nby the spider.  Note that this list exludes any multi
+media files";
        print "\nand other files not likely to contain HTML code.\n";
        print keys %$doc_list;
    }
}

#Method that collects the names of HTML documents on a site
sub get_doc_list {
    my ($self) = @_;
    
    #Prepare parser object for use 
    my $parser = HTML::Parser->new(api_version => 3);
    $parser->handler(start => sub { $self->_add_link( @_ ) }, 'tagname
+,attr');
    
    my $url = $self->{_url};
    my $doc_list = $self->{_doc_list};
    
    #Set the flag that states whether more documents need parsing
    my $all_not_parsed = 1;
    
    #Seed hash with user supplied URL
    $doc_list->{$url} = NOT_PARSED;

#Main loop
################################################################
    while ($all_not_parsed) {
        my %temp_list = %$doc_list;
        foreach my $key (keys %temp_list) {
            if ($doc_list->{$key} == NOT_PARSED) {
                
                #Set the URL to the name of the next document in the h
+ash
                $url = $key;
                
                #$base and $authority are global-like object variable 
+used by &add_link() and its
                #helper subroutines to determine if a link has the sam
+e domain as the document it was found on
                $self->{_base} = URI->new($url);
                $self->{_authority} = $self->{_base}->authority;
                
                #The following line is a temporary hack.
                #Its weakness is that it assumes subdomains don't exis
+t.
                $self->{_authority} =~ s/^www\.//i;
                
                #In a future revision, printing can be turned 
                #off by a switch in the object set by the user.       
+     
                print "\nFetching: " . $key;
                
                #Download and examine document
                if ( my $content = get($key) ) {
                    print "\nExamining: $key";
                    $parser->parse($content);
                }
            }
            $doc_list->{$key} = PARSED;
        }
        
        #Assume all documents have been parsed
        $all_not_parsed = 0;
        
        #Search through found documents and see if any still need pars
+ing
        #If one is found, reset the '$all_not_parsed' flag.
        foreach my $key (keys %$doc_list) {
            if ($doc_list->{$key} == NOT_PARSED) {
                $all_not_parsed = 1;
                last;
            }
        }
    }
################################################################    
    
    #Following two lines will be replaced by a method
    #to print out 

}

################################################################
#HTML::Parser callback routine used to find and collect links within a
+ document
sub _add_link {
    my ($self,$tagname,$attr) = @_;
    my $doc_list = $self->{_doc_list};

    
    #Find all the links in a frames doc or regular page
    if ( (  $tagname eq 'a' && $attr->{href}  )   ||  ( $tagname eq 'f
+rame' && $attr->{src} )  ) {
        my $link = $attr->{href} || $attr->{src};
        
        #Generate the absolute path for the link
        #Extract link from javascript if it exists
        if ($link =~ /^javascript/i) {
            $link = _extract_from_javascript($link);
        }
        my $link_abs = $self->_make_absolute($link);
        
        
        #Make sure that the document linked to is likely to contain mo
+re HTML code
        #Make sure the link hasn't already been entered into the %doc_
+list hash
        #Make sure the link isn't an internal anchor tag
        my $link_authority = _get_authority($link_abs);
        if ( ( $link_abs !~ /$self->{_avoid_ext}$/i ) &&
            !(  exists( $doc_list->{$link_abs} )  ) && ( $link_abs !~ 
+/#/ )   ) {
            
            
            if (  $link_authority eq $self->{_authority} ) {
                #First see if the link has same domain name as the web
+ document it was found on    
                #then enter the absolute link path into the hash as a 
+key and
                #set its value to NOT_PARSED (meaning it still needs t
+o be examined to see if it
                #contains links to more documents on the site).
                $doc_list->{$link_abs} = NOT_PARSED;
            } else {
                ;
                # RESERVED FOR FUTURE USE
                # This space intended to be filled with code to collec
+t the URLs
                # of documents outside the domain the user has request
+ed.
            }
        }
    }
}
################################################################


#HELPER SUBROUTINES FOR THE PARSER's &add_link() SUBROUTINE
########################################################
#Create an absolute link from relative ones
#Note: links that are already absolute untouched by the URI::abs() met
+hod.
sub _make_absolute {
    my ($self,$address) = @_;
    my $address_as_uri = URI->new($address);
    
    #Note: Base is set for each document that is parsed
    #print $base_uri;
    my $absolute_address = $address_as_uri->abs($self->{_base});
    return $absolute_address->as_string;
}


#Extracts link from javascript (usually found in pop-up boxes)
sub _extract_from_javascript {
    my $javascript = shift;
    my ($address) = $javascript =~ /["|'](.*)["|']/;
    return $address;
}

#Returns the domain name and domain name extension (.com, org) for the
+ link
#For example, 'http://www.nytimes.com/index' becomes 'nytimes.com'
#This subroutine needs to be fixed to chop off subdomains as well
#   It currently only chops off leading 'www'
#   'http://graphics.nytimes.com' becomes 'graphics.nytimes.com' not '
+nytimes.com' as it probably should
sub _get_authority {
    my $address = shift;
    my $address_as_uri = URI->new($address);
    
    #Make sure it's not a mailto, ftp, etc. link
    my $scheme = $address_as_uri->scheme;
    if ($scheme eq 'http') {
        $address = $address_as_uri->authority;
    }
    
    #Chop off the leading 'www', if it exists
    #This is a temporary hack as it assumes subdomains don't exist
    $address =~ s/^www\.//i;
    return $address;
}
########################################################
1;
[download]

$PM = "Perl Monk's";
$MCF = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ Pontiff";
$nysus = $PM . $MCF;
Click here if you love Perl Monks

Comment on My First Real Program: YAPS (Yet Another Perl Spider) Download Code