What's good about it is that once you have this list of docs, it can be used as a master list of all documents of the site which can then be recrawled to find/download more specific kinds of content. I realize it is wasteful to crawl a site again to get info after it's already been crawled. In the next release, I'll have the object store the entire crawled web page and then include methods to print/recrawl them. For now, I just want to KISS and make sure I get this module off on strong footing.
Anyway, this is my first crack at a Perl OO module or any kind of program, for that matter, that is intended for a wider audience than me. But it certainly wouldn't be a true Perl module without some feedback from my brothers and sisters at Perlmonks. I welcome all feedback, constructive or otherwise, to help make my contribution to the Perl community as good as it can be.
Thanks! Here it is:
package SiteDocSpider; use strict; use warnings; use LWP::Simple; use HTML::Parser; use URI; use constant NOT_PARSED => 0; use constant PARSED => 1; sub new { my ($class) = @_; bless { #_doc_list is the module's MIP (most important variable) #It is the hash that contains "a list of documents most likely + to contain HTML on the website." #The keys are the URLs of the docs. #The value of each _doc_list hash entry is a flag. #When the flag is set to 'NOT PARSED', it means the document n +eeds to be parsed to examine it for links to additional documents. #When set to 'PARSED', it indicates the document has already b +een examined for links. _doc_list => {}, _url => $_[1], #Don't attempt to fetch content from documents with the follow +ing extenstions as #they probably won't contain any HTML to examine for more link +s _avoid_ext => 'wmv|avi|rm|mpg|asf|ram|asx|mpeg|mp3|mpe| +wav|jpg|jpeg|bmp|gif|png|xbm|xmp|swf|doc|xls|txt|ppt|pdf', #$base stores a URI object for the documents about to be parse +d. #$authority stores the authority for the URI object found in $ +base. #I put them in the object to avoid using a global variables. #See while loop below _base => '', _authority => '', }, $class; } #Method that prints out collected URLs as a 1 column list sub print_doc_list { my ($self) = @_; my $doc_list = $self->{_doc_list}; { local $, = "\n"; print "\n\n"; print "The following documents on $self->{_url} were found"; print "\nby the spider. Note that this list exludes any multi +media files"; print "\nand other files not likely to contain HTML code.\n"; print keys %$doc_list; } } #Method that collects the names of HTML documents on a site sub get_doc_list { my ($self) = @_; #Prepare parser object for use my $parser = HTML::Parser->new(api_version => 3); $parser->handler(start => sub { $self->_add_link( @_ ) }, 'tagname +,attr'); my $url = $self->{_url}; my $doc_list = $self->{_doc_list}; #Set the flag that states whether more documents need parsing my $all_not_parsed = 1; #Seed hash with user supplied URL $doc_list->{$url} = NOT_PARSED; #Main loop ################################################################ while ($all_not_parsed) { my %temp_list = %$doc_list; foreach my $key (keys %temp_list) { if ($doc_list->{$key} == NOT_PARSED) { #Set the URL to the name of the next document in the h +ash $url = $key; #$base and $authority are global-like object variable +used by &add_link() and its #helper subroutines to determine if a link has the sam +e domain as the document it was found on $self->{_base} = URI->new($url); $self->{_authority} = $self->{_base}->authority; #The following line is a temporary hack. #Its weakness is that it assumes subdomains don't exis +t. $self->{_authority} =~ s/^www\.//i; #In a future revision, printing can be turned #off by a switch in the object set by the user. + print "\nFetching: " . $key; #Download and examine document if ( my $content = get($key) ) { print "\nExamining: $key"; $parser->parse($content); } } $doc_list->{$key} = PARSED; } #Assume all documents have been parsed $all_not_parsed = 0; #Search through found documents and see if any still need pars +ing #If one is found, reset the '$all_not_parsed' flag. foreach my $key (keys %$doc_list) { if ($doc_list->{$key} == NOT_PARSED) { $all_not_parsed = 1; last; } } } ################################################################ #Following two lines will be replaced by a method #to print out } ################################################################ #HTML::Parser callback routine used to find and collect links within a + document sub _add_link { my ($self,$tagname,$attr) = @_; my $doc_list = $self->{_doc_list}; #Find all the links in a frames doc or regular page if ( ( $tagname eq 'a' && $attr->{href} ) || ( $tagname eq 'f +rame' && $attr->{src} ) ) { my $link = $attr->{href} || $attr->{src}; #Generate the absolute path for the link #Extract link from javascript if it exists if ($link =~ /^javascript/i) { $link = _extract_from_javascript($link); } my $link_abs = $self->_make_absolute($link); #Make sure that the document linked to is likely to contain mo +re HTML code #Make sure the link hasn't already been entered into the %doc_ +list hash #Make sure the link isn't an internal anchor tag my $link_authority = _get_authority($link_abs); if ( ( $link_abs !~ /$self->{_avoid_ext}$/i ) && !( exists( $doc_list->{$link_abs} ) ) && ( $link_abs !~ +/#/ ) ) { if ( $link_authority eq $self->{_authority} ) { #First see if the link has same domain name as the web + document it was found on #then enter the absolute link path into the hash as a +key and #set its value to NOT_PARSED (meaning it still needs t +o be examined to see if it #contains links to more documents on the site). $doc_list->{$link_abs} = NOT_PARSED; } else { ; # RESERVED FOR FUTURE USE # This space intended to be filled with code to collec +t the URLs # of documents outside the domain the user has request +ed. } } } } ################################################################ #HELPER SUBROUTINES FOR THE PARSER's &add_link() SUBROUTINE ######################################################## #Create an absolute link from relative ones #Note: links that are already absolute untouched by the URI::abs() met +hod. sub _make_absolute { my ($self,$address) = @_; my $address_as_uri = URI->new($address); #Note: Base is set for each document that is parsed #print $base_uri; my $absolute_address = $address_as_uri->abs($self->{_base}); return $absolute_address->as_string; } #Extracts link from javascript (usually found in pop-up boxes) sub _extract_from_javascript { my $javascript = shift; my ($address) = $javascript =~ /["|'](.*)["|']/; return $address; } #Returns the domain name and domain name extension (.com, org) for the + link #For example, 'http://www.nytimes.com/index' becomes 'nytimes.com' #This subroutine needs to be fixed to chop off subdomains as well # It currently only chops off leading 'www' # 'http://graphics.nytimes.com' becomes 'graphics.nytimes.com' not ' +nytimes.com' as it probably should sub _get_authority { my $address = shift; my $address_as_uri = URI->new($address); #Make sure it's not a mailto, ftp, etc. link my $scheme = $address_as_uri->scheme; if ($scheme eq 'http') { $address = $address_as_uri->authority; } #Chop off the leading 'www', if it exists #This is a temporary hack as it assumes subdomains don't exist $address =~ s/^www\.//i; return $address; } ######################################################## 1;
$PM = "Perl Monk's";
$MCF = "Most Clueless Friar Abbot Bishop Pontiff";
$nysus = $PM . $MCF;
Click here if you love Perl Monks
In reply to My First Real Program: YAPS (Yet Another Perl Spider) by nysus
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |