comment on

Hope everyone doesn't mind but I'm starting a new node to what I think is now fine tuning comparatively speaking to where I at started. Let me know if this is not proper.

So my aim is to create a script which I can eventually turn into a Perl⁄TK GUI app that allows some automatic correction of Search Engine Optimization (SEO) problems with websites. The script should look for missing titles, meta descriptions, meta keywords, and img tags with missing alt attributes.

For the missing titles the program tries to use the first h1, h2, h3, h4, or p tag content it encounters. For description it uses the Lingua::EN::Summarize module. For keywords it uses Lingua::EN::Keywords. For alt attributes it uses the file name minus the full path and extension elements. The dream eventually is to use an API for image online image matching app like Google Goggles to get a fix on a text description of the identity of the img to put in the alt attribute...but that is for later. Right now the img source file name is extracted with File::Basename. Finally it is using Simple::Robot to traverse websites and serve up URLs fpr processing.

Thanks to Anonymous Monk, toolic and GrandFather so far. As I'm sure is apparent I'm not only a neophyte acolyte Perl writer but programmer period, if that.

-----Update-----

I'm updating earlier code. I'm getting HTML output from below, but it is not spidering all the pages in a domain, maybe because of the REGEX statement, I don't know. Also there is some strange array output after the <body> tag. One reads for example:

" _contentheadtextHTML::Element=HASH(0x138563c) _contentheadtextHTML::Element=HASH(0x13a25fc)

I guess in reality the HTML::Element=HASH(0x138563c) and HTML::Element=HASH(0x13a25fc) are the the two meta tags I'm trying to insert into the header, so my problem still exists. I think this is a basic misunderstanding of HTML::Element, HTML::Tree and HTML::TreeBuilder so I'm going back to study these in CPAN.


#!/usr/bin/perl
package Metabot;

use warnings;
use strict;
use WWW::SimpleRobot;
use HTML::Entities;
require HTML::Parser;
use Lingua::EN::Summarize;
use HTML::Summary;
use HTML::TreeBuilder;
use Lingua::EN::Keywords; 
use HTML::Tree;
use LWP::Simple;

@Metabot::ISA = qw(HTML::Parser);

my $url = $ARGV[0];

my $parser = Metabot->new;

my $robot = WWW::SimpleRobot->new(
    URLS            => [ $url ],
    FOLLOW_REGEX    => "^$url",
    DEPTH           => 2,
    TRAVERSAL       => 'depth',
    VISIT_CALLBACK  => \&Botulism,
    BROKEN_LINK_CALLBACK  => \&Snicklefritz,
);

$robot->traverse;

my @urls = @{$robot->urls};

my @pages = @{$robot->pages};

for my $page ( @pages )    {
    my $url = $page->{url};
    my $depth = $page->{depth};
    my $modification_time = $page->{modification_time};
}

sub Botulism {
    my ( $url, $depth, $html, $links ) = @_;
    print "\nURL: $url - depth $depth\n"; 
            $html = decode_entities($html);
            $html =~ s/document\.write\(.+?\)\;//g;
            $html =~ s/\&amp;\#.+?\;//g;
            my $tree = HTML::TreeBuilder->new();
            $tree->parse($html);
    
      no warnings 'uninitialized';
    
    eval {
        my $Title = substr $tree->look_down( '_tag', 'title' )->as_tex
+t , 0, 65;
        print "Title exists and is: $Title.\n";
    } or do {
        my $Title;
        for my $tag( qw' h1 h2 h3 h4 p ' ){
            last if eval {
                $Title = substr $tree->look_down( '_tag', $tag )->as_t
+ext , 0, 65;
                if( length $Title ){
                    $html->push_content($Title);
                    print "No title was found so the first $tag tag co
+ntents \n
                    were written to the title field in the header.\n";
                }
            }
        }
        unless($Title){
            print "No title exists and no suitable \ntext 
            was found by this bot to use as one.\n";
        }
    };
            my $filteredhtml = summarize( $html, filter => 'html' );
            my $summary = summarize( $filteredhtml, maxlength => 500 )
+;
            $summary =~ s/\s+/ /gs;
            my $var = substr($summary, 0, 155);
            print "Using Lingua::EN::Summarize Summary: $var\n\n"; 
            
            local $\ = $/;
            my $newmetadescription = HTML::Element->new('meta', 'name'
+ => 'description', 'content' => "$var");
            $tree->push_content("_content", "head", "text", "$newmetad
+escription");
            $newmetadescription = $newmetadescription->delete;
            
            my $title = substr $tree->look_down( '_tag', 'title' )->as
+_text , 0, 65;
            my @keywords = keywords($title.$summary);
            print "Keywords: " . join(", ", @keywords) . "\n\n";
            
            local $\ = $/;
            my $newmetakeywords = HTML::Element->new('meta', 'name' =>
+ 'keywords', 'content' => "@keywords");
            $tree->push_content("_content", "head", "text", "$newmetak
+eywords");
            $newmetakeywords = $newmetakeywords->delete;
            
            
            local $\ = $/;
            print $_->as_HTML
              for $tree->look_down( '_tag', 'img ',
                sub { not defined $_[0]->attr('alt') } );

            print '---';

            print $_->as_HTML
              for $tree->look_down( qw' _tag img ',
                sub { not length $_[0]->attr('alt') } );

            print '---';

            $_->attr( alt => MAlt($_) )
              for $tree->look_down( qw' _tag img ',
                sub { not length $_[0]->attr('alt') } );
            print $_->as_HTML for $tree->look_down(qw' _tag img ');
            
            print $tree->as_HTML;
            $tree = $tree->delete;
}

sub MAlt {
my $imgscalar = $_[0];
my $imgsrc = $imgscalar->attr('src');
use File::Basename;
my @suffixlist = qw(.gif .jpg .jpeg .png .bmp .php .ico .GIF .JPG .JPE
+G .PNG .BMP .PHP .ICO);
my $imgfilenopathnoext = fileparse($imgsrc,@suffixlist);
'!' . $imgfilenopathnoext;
}

sub Snicklefritz {
    my ( $url, $linked_from, $depth ) = @_;
    print "The link $url from the page $linked_from at depth $depth\n
    appears to be broken.  please repair the link manually\n";
}

sub Ebola {
    my( $html, $clip, $text ) = @_;
    if(defined $text and length $text ) {
        $text = substr $text, 0, $clip;
        $html->push_content( $text );
    }
}
}
[download]

Another question what is sub Ebola supposed to be doing?

In reply to SEO Fixer Part II - Updated by socrtwo

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.