Hope everyone doesn't mind but I'm starting a new node to what I think is now fine tuning comparatively speaking to where I at started. Let me know if this is not proper.
So my aim is to create a script which I can eventually turn into a Perl⁄TK GUI app that allows some automatic correction of Search Engine Optimization (SEO) problems with websites. The script should look for missing titles, meta descriptions, meta keywords, and img tags with missing alt attributes.
For the missing titles the program tries to use the first h1, h2, h3, h4, or p tag content it encounters. For description it uses the Lingua::EN::Summarize module. For keywords it uses Lingua::EN::Keywords. For alt attributes it uses the file name minus the full path and extension elements. The dream eventually is to use an API for image online image matching app like Google Goggles to get a fix on a text description of the identity of the img to put in the alt attribute...but that is for later. Right now the img source file name is extracted with File::Basename. Finally it is using Simple::Robot to traverse websites and serve up URLs fpr processing.
Thanks to Anonymous Monk, toolic and GrandFather so far. As I'm sure is apparent I'm not only a neophyte acolyte Perl writer but programmer period, if that.
-----Update-----
I'm updating earlier code. I'm getting HTML output from below, but it is not spidering all the pages in a domain, maybe because of the REGEX statement, I don't know. Also there is some strange array output after the <body> tag. One reads for example:
" _contentheadtextHTML::Element=HASH(0x138563c) _contentheadtextHTML::Element=HASH(0x13a25fc)
I guess in reality the HTML::Element=HASH(0x138563c) and HTML::Element=HASH(0x13a25fc) are the the two meta tags I'm trying to insert into the header, so my problem still exists. I think this is a basic misunderstanding of HTML::Element, HTML::Tree and HTML::TreeBuilder so I'm going back to study these in CPAN.
#!/usr/bin/perl
package Metabot;
use warnings;
use strict;
use WWW::SimpleRobot;
use HTML::Entities;
require HTML::Parser;
use Lingua::EN::Summarize;
use HTML::Summary;
use HTML::TreeBuilder;
use Lingua::EN::Keywords;
use HTML::Tree;
use LWP::Simple;
@Metabot::ISA = qw(HTML::Parser);
my $url = $ARGV[0];
my $parser = Metabot->new;
my $robot = WWW::SimpleRobot->new(
URLS => [ $url ],
FOLLOW_REGEX => "^$url",
DEPTH => 2,
TRAVERSAL => 'depth',
VISIT_CALLBACK => \&Botulism,
BROKEN_LINK_CALLBACK => \&Snicklefritz,
);
$robot->traverse;
my @urls = @{$robot->urls};
my @pages = @{$robot->pages};
for my $page ( @pages ) {
my $url = $page->{url};
my $depth = $page->{depth};
my $modification_time = $page->{modification_time};
}
sub Botulism {
my ( $url, $depth, $html, $links ) = @_;
print "\nURL: $url - depth $depth\n";
$html = decode_entities($html);
$html =~ s/document\.write\(.+?\)\;//g;
$html =~ s/\&\#.+?\;//g;
my $tree = HTML::TreeBuilder->new();
$tree->parse($html);
no warnings 'uninitialized';
eval {
my $Title = substr $tree->look_down( '_tag', 'title' )->as_tex
+t , 0, 65;
print "Title exists and is: $Title.\n";
} or do {
my $Title;
for my $tag( qw' h1 h2 h3 h4 p ' ){
last if eval {
$Title = substr $tree->look_down( '_tag', $tag )->as_t
+ext , 0, 65;
if( length $Title ){
$html->push_content($Title);
print "No title was found so the first $tag tag co
+ntents \n
were written to the title field in the header.\n";
}
}
}
unless($Title){
print "No title exists and no suitable \ntext
was found by this bot to use as one.\n";
}
};
my $filteredhtml = summarize( $html, filter => 'html' );
my $summary = summarize( $filteredhtml, maxlength => 500 )
+;
$summary =~ s/\s+/ /gs;
my $var = substr($summary, 0, 155);
print "Using Lingua::EN::Summarize Summary: $var\n\n";
local $\ = $/;
my $newmetadescription = HTML::Element->new('meta', 'name'
+ => 'description', 'content' => "$var");
$tree->push_content("_content", "head", "text", "$newmetad
+escription");
$newmetadescription = $newmetadescription->delete;
my $title = substr $tree->look_down( '_tag', 'title' )->as
+_text , 0, 65;
my @keywords = keywords($title.$summary);
print "Keywords: " . join(", ", @keywords) . "\n\n";
local $\ = $/;
my $newmetakeywords = HTML::Element->new('meta', 'name' =>
+ 'keywords', 'content' => "@keywords");
$tree->push_content("_content", "head", "text", "$newmetak
+eywords");
$newmetakeywords = $newmetakeywords->delete;
local $\ = $/;
print $_->as_HTML
for $tree->look_down( '_tag', 'img ',
sub { not defined $_[0]->attr('alt') } );
print '---';
print $_->as_HTML
for $tree->look_down( qw' _tag img ',
sub { not length $_[0]->attr('alt') } );
print '---';
$_->attr( alt => MAlt($_) )
for $tree->look_down( qw' _tag img ',
sub { not length $_[0]->attr('alt') } );
print $_->as_HTML for $tree->look_down(qw' _tag img ');
print $tree->as_HTML;
$tree = $tree->delete;
}
sub MAlt {
my $imgscalar = $_[0];
my $imgsrc = $imgscalar->attr('src');
use File::Basename;
my @suffixlist = qw(.gif .jpg .jpeg .png .bmp .php .ico .GIF .JPG .JPE
+G .PNG .BMP .PHP .ICO);
my $imgfilenopathnoext = fileparse($imgsrc,@suffixlist);
'!' . $imgfilenopathnoext;
}
sub Snicklefritz {
my ( $url, $linked_from, $depth ) = @_;
print "The link $url from the page $linked_from at depth $depth\n
appears to be broken. please repair the link manually\n";
}
sub Ebola {
my( $html, $clip, $text ) = @_;
if(defined $text and length $text ) {
$text = substr $text, 0, $clip;
$html->push_content( $text );
}
}
}
Another question what is sub Ebola supposed to be doing?
.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.