socrtwo has asked for the wisdom of the Perl Monks concerning the following question:
Hope everyone doesn't mind but I'm starting a new node to what I think is now fine tuning comparatively speaking to where I at started. Let me know if this is not proper.
So my aim is to create a script which I can eventually turn into a Perl⁄TK GUI app that allows some automatic correction of Search Engine Optimization (SEO) problems with websites. The script should look for missing titles, meta descriptions, meta keywords, and img tags with missing alt attributes.
For the missing titles the program tries to use the first h1, h2, h3, h4, or p tag content it encounters. For description it uses the Lingua::EN::Summarize module. For keywords it uses Lingua::EN::Keywords. For alt attributes it uses the file name minus the full path and extension elements. The dream eventually is to use an API for image online image matching app like Google Goggles to get a fix on a text description of the identity of the img to put in the alt attribute...but that is for later. Right now the img source file name is extracted with File::Basename. Finally it is using Simple::Robot to traverse websites and serve up URLs fpr processing.
Thanks to Anonymous Monk, toolic and GrandFather so far. As I'm sure is apparent I'm not only a neophyte acolyte Perl writer but programmer period, if that.
-----Update-----
I'm updating earlier code. I'm getting HTML output from below, but it is not spidering all the pages in a domain, maybe because of the REGEX statement, I don't know. Also there is some strange array output after the <body> tag. One reads for example:
" _contentheadtextHTML::Element=HASH(0x138563c) _contentheadtextHTML::Element=HASH(0x13a25fc)
I guess in reality the HTML::Element=HASH(0x138563c) and HTML::Element=HASH(0x13a25fc) are the the two meta tags I'm trying to insert into the header, so my problem still exists. I think this is a basic misunderstanding of HTML::Element, HTML::Tree and HTML::TreeBuilder so I'm going back to study these in CPAN.
#!/usr/bin/perl package Metabot; use warnings; use strict; use WWW::SimpleRobot; use HTML::Entities; require HTML::Parser; use Lingua::EN::Summarize; use HTML::Summary; use HTML::TreeBuilder; use Lingua::EN::Keywords; use HTML::Tree; use LWP::Simple; @Metabot::ISA = qw(HTML::Parser); my $url = $ARGV[0]; my $parser = Metabot->new; my $robot = WWW::SimpleRobot->new( URLS => [ $url ], FOLLOW_REGEX => "^$url", DEPTH => 2, TRAVERSAL => 'depth', VISIT_CALLBACK => \&Botulism, BROKEN_LINK_CALLBACK => \&Snicklefritz, ); $robot->traverse; my @urls = @{$robot->urls}; my @pages = @{$robot->pages}; for my $page ( @pages ) { my $url = $page->{url}; my $depth = $page->{depth}; my $modification_time = $page->{modification_time}; } sub Botulism { my ( $url, $depth, $html, $links ) = @_; print "\nURL: $url - depth $depth\n"; $html = decode_entities($html); $html =~ s/document\.write\(.+?\)\;//g; $html =~ s/\&\#.+?\;//g; my $tree = HTML::TreeBuilder->new(); $tree->parse($html); no warnings 'uninitialized'; eval { my $Title = substr $tree->look_down( '_tag', 'title' )->as_tex +t , 0, 65; print "Title exists and is: $Title.\n"; } or do { my $Title; for my $tag( qw' h1 h2 h3 h4 p ' ){ last if eval { $Title = substr $tree->look_down( '_tag', $tag )->as_t +ext , 0, 65; if( length $Title ){ $html->push_content($Title); print "No title was found so the first $tag tag co +ntents \n were written to the title field in the header.\n"; } } } unless($Title){ print "No title exists and no suitable \ntext was found by this bot to use as one.\n"; } }; my $filteredhtml = summarize( $html, filter => 'html' ); my $summary = summarize( $filteredhtml, maxlength => 500 ) +; $summary =~ s/\s+/ /gs; my $var = substr($summary, 0, 155); print "Using Lingua::EN::Summarize Summary: $var\n\n"; local $\ = $/; my $newmetadescription = HTML::Element->new('meta', 'name' + => 'description', 'content' => "$var"); $tree->push_content("_content", "head", "text", "$newmetad +escription"); $newmetadescription = $newmetadescription->delete; my $title = substr $tree->look_down( '_tag', 'title' )->as +_text , 0, 65; my @keywords = keywords($title.$summary); print "Keywords: " . join(", ", @keywords) . "\n\n"; local $\ = $/; my $newmetakeywords = HTML::Element->new('meta', 'name' => + 'keywords', 'content' => "@keywords"); $tree->push_content("_content", "head", "text", "$newmetak +eywords"); $newmetakeywords = $newmetakeywords->delete; local $\ = $/; print $_->as_HTML for $tree->look_down( '_tag', 'img ', sub { not defined $_[0]->attr('alt') } ); print '---'; print $_->as_HTML for $tree->look_down( qw' _tag img ', sub { not length $_[0]->attr('alt') } ); print '---'; $_->attr( alt => MAlt($_) ) for $tree->look_down( qw' _tag img ', sub { not length $_[0]->attr('alt') } ); print $_->as_HTML for $tree->look_down(qw' _tag img '); print $tree->as_HTML; $tree = $tree->delete; } sub MAlt { my $imgscalar = $_[0]; my $imgsrc = $imgscalar->attr('src'); use File::Basename; my @suffixlist = qw(.gif .jpg .jpeg .png .bmp .php .ico .GIF .JPG .JPE +G .PNG .BMP .PHP .ICO); my $imgfilenopathnoext = fileparse($imgsrc,@suffixlist); '!' . $imgfilenopathnoext; } sub Snicklefritz { my ( $url, $linked_from, $depth ) = @_; print "The link $url from the page $linked_from at depth $depth\n appears to be broken. please repair the link manually\n"; } sub Ebola { my( $html, $clip, $text ) = @_; if(defined $text and length $text ) { $text = substr $text, 0, $clip; $html->push_content( $text ); } } }
Another question what is sub Ebola supposed to be doing?
.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: SEO Fixer Part II - Updated
by Anonymous Monk on Apr 06, 2011 at 06:50 UTC | |
by socrtwo (Sexton) on Apr 06, 2011 at 15:06 UTC | |
by socrtwo (Sexton) on Apr 06, 2011 at 16:03 UTC | |
by Anonymous Monk on Apr 06, 2011 at 20:24 UTC |