SEO Fixer Part II

socrtwo has asked for the wisdom of the Perl Monks concerning the following question:

Hope everyone doesn't mind but I'm starting a new node to what I think is now fine tuning comparatively speaking to where I at started. Let me know if this is not proper.

So my aim is to create a script which I can eventually turn into a Perl⁄TK GUI app that allows some automatic correction of Search Engine Optimization (SEO) problems with websites. The script should look for missing titles, meta descriptions, meta keywords, and img tags with missing alt attributes.

For the missing titles the program tries to use the first h1, h2, h3, h4, or p tag content it encounters. For description it uses the Lingua::EN::Summarize module. For keywords it uses Lingua::EN::Keywords. For alt attributes it uses the file name minus the full path and extension elements. The dream eventually is to use an API for image online image matching app like Google Goggles to get a fix on a text description of the identity of the img to put in the alt attribute...but that is for later. Right now the img source file name is extracted with File::Basename. Finally it is using Simple::Robot to traverse websites and serve up URLs fpr processing.

Thanks to Anonymous Monk, toolic and GrandFather so far. As I'm sure is apparent I'm not only a neophyte acolyte Perl writer but programmer period, if that.

-----Update-----

I'm updating earlier code. I'm getting HTML output from below, but it is not spidering all the pages in a domain, maybe because of the REGEX statement, I don't know. Also there is some strange array output after the <body> tag. One reads for example:

" _contentheadtextHTML::Element=HASH(0x138563c) _contentheadtextHTML::Element=HASH(0x13a25fc)

I guess in reality the HTML::Element=HASH(0x138563c) and HTML::Element=HASH(0x13a25fc) are the the two meta tags I'm trying to insert into the header, so my problem still exists. I think this is a basic misunderstanding of HTML::Element, HTML::Tree and HTML::TreeBuilder so I'm going back to study these in CPAN.


#!/usr/bin/perl
package Metabot;

use warnings;
use strict;
use WWW::SimpleRobot;
use HTML::Entities;
require HTML::Parser;
use Lingua::EN::Summarize;
use HTML::Summary;
use HTML::TreeBuilder;
use Lingua::EN::Keywords; 
use HTML::Tree;
use LWP::Simple;

@Metabot::ISA = qw(HTML::Parser);

my $url = $ARGV[0];

my $parser = Metabot->new;

my $robot = WWW::SimpleRobot->new(
    URLS            => [ $url ],
    FOLLOW_REGEX    => "^$url",
    DEPTH           => 2,
    TRAVERSAL       => 'depth',
    VISIT_CALLBACK  => \&Botulism,
    BROKEN_LINK_CALLBACK  => \&Snicklefritz,
);

$robot->traverse;

my @urls = @{$robot->urls};

my @pages = @{$robot->pages};

for my $page ( @pages )    {
    my $url = $page->{url};
    my $depth = $page->{depth};
    my $modification_time = $page->{modification_time};
}

sub Botulism {
    my ( $url, $depth, $html, $links ) = @_;
    print "\nURL: $url - depth $depth\n"; 
            $html = decode_entities($html);
            $html =~ s/document\.write\(.+?\)\;//g;
            $html =~ s/\&amp;\#.+?\;//g;
            my $tree = HTML::TreeBuilder->new();
            $tree->parse($html);
    
      no warnings 'uninitialized';
    
    eval {
        my $Title = substr $tree->look_down( '_tag', 'title' )->as_tex
+t , 0, 65;
        print "Title exists and is: $Title.\n";
    } or do {
        my $Title;
        for my $tag( qw' h1 h2 h3 h4 p ' ){
            last if eval {
                $Title = substr $tree->look_down( '_tag', $tag )->as_t
+ext , 0, 65;
                if( length $Title ){
                    $html->push_content($Title);
                    print "No title was found so the first $tag tag co
+ntents \n
                    were written to the title field in the header.\n";
                }
            }
        }
        unless($Title){
            print "No title exists and no suitable \ntext 
            was found by this bot to use as one.\n";
        }
    };
            my $filteredhtml = summarize( $html, filter => 'html' );
            my $summary = summarize( $filteredhtml, maxlength => 500 )
+;
            $summary =~ s/\s+/ /gs;
            my $var = substr($summary, 0, 155);
            print "Using Lingua::EN::Summarize Summary: $var\n\n"; 
            
            local $\ = $/;
            my $newmetadescription = HTML::Element->new('meta', 'name'
+ => 'description', 'content' => "$var");
            $tree->push_content("_content", "head", "text", "$newmetad
+escription");
            $newmetadescription = $newmetadescription->delete;
            
            my $title = substr $tree->look_down( '_tag', 'title' )->as
+_text , 0, 65;
            my @keywords = keywords($title.$summary);
            print "Keywords: " . join(", ", @keywords) . "\n\n";
            
            local $\ = $/;
            my $newmetakeywords = HTML::Element->new('meta', 'name' =>
+ 'keywords', 'content' => "@keywords");
            $tree->push_content("_content", "head", "text", "$newmetak
+eywords");
            $newmetakeywords = $newmetakeywords->delete;
            
            
            local $\ = $/;
            print $_->as_HTML
              for $tree->look_down( '_tag', 'img ',
                sub { not defined $_[0]->attr('alt') } );

            print '---';

            print $_->as_HTML
              for $tree->look_down( qw' _tag img ',
                sub { not length $_[0]->attr('alt') } );

            print '---';

            $_->attr( alt => MAlt($_) )
              for $tree->look_down( qw' _tag img ',
                sub { not length $_[0]->attr('alt') } );
            print $_->as_HTML for $tree->look_down(qw' _tag img ');
            
            print $tree->as_HTML;
            $tree = $tree->delete;
}

sub MAlt {
my $imgscalar = $_[0];
my $imgsrc = $imgscalar->attr('src');
use File::Basename;
my @suffixlist = qw(.gif .jpg .jpeg .png .bmp .php .ico .GIF .JPG .JPE
+G .PNG .BMP .PHP .ICO);
my $imgfilenopathnoext = fileparse($imgsrc,@suffixlist);
'!' . $imgfilenopathnoext;
}

sub Snicklefritz {
    my ( $url, $linked_from, $depth ) = @_;
    print "The link $url from the page $linked_from at depth $depth\n
    appears to be broken.  please repair the link manually\n";
}

sub Ebola {
    my( $html, $clip, $text ) = @_;
    if(defined $text and length $text ) {
        $text = substr $text, 0, $clip;
        $html->push_content( $text );
    }
}
}
[download]

Another question what is sub Ebola supposed to be doing?

Comment on SEO Fixer Part II - Updated Download Code

Replies are listed 'Best First'.
Re: SEO Fixer Part II - Updated by Anonymous Monk on Apr 06, 2011 at 06:50 UTC
I think this is a basic misunderstanding of ... Its much more basic than that, see Re^4: Trying to Insert Alt Tags Programmatically, your program, once again, looks nothing like that, you still got Metabot, and all the other things you copy/pasted, except now you have Ebola and Botulism too. `#!/usr/bin/perl -- use strict; use warnings; use ... all the other things that you need; Main( @ARGV ): exit( 0 );` [download] Another question what is sub Ebola supposed to be doing? Nothing if you don't know. I could try to explain how it was supposed to replace part of your code that kept repeating, how you're supposed to make sure it does what you need, and name it something descriptive, but honestly, whats the point, you're just going to ignore what i wrote, and copy/paste the code section	[reply] [d/l]
Re^2: SEO Fixer Part II - Updated by socrtwo (Sexton) on Apr 06, 2011 at 15:06 UTC
My profound apologies, but I reverted to another form of the program because it wasn't spidering the way it used to. I didn't mean to do things ignorantly. I tried to semi-intelligently keep things you had suggested like using the subroutines for VISIT_CALLBACK and BROKEN_LINK_CALLBACK which I now understand are what is to activated when respectively a legitimate and broken link are spidered.	[reply]
Re^2: SEO Fixer Part II - Updated by socrtwo (Sexton) on Apr 06, 2011 at 16:03 UTC
To anonymous, my apologies once again. I didn't mean to offend your sense of the effort I should be putting in and I really am grateful for the help you are giving me. I love Perl when I can figure it out, but have difficulty learning things in a classroom setting. Additionally a lot of CPAN is impenetrable and discouraging to me. I know I'm abusing Perl Monks and other source by hacking my way through this, but believe me, I'm learning a lot even though it's shocking often what I don't know, I'm sure. I have hacked my way through things successfully in the past you can see that by looking up my record here. I have a bunch of hacked Perl things I have placed on sourceforge too. Maybe my motivations are vainglorious, and am abusing Perl Monks to get there, but I also love each new bit I learn new about Perl. It's a very powerful language and despite my trouble maybe the next easiest after html. I remember now a reason I may have gone back to the original. I also received an error about the $FOLLOW scalar not being initiated. I just initiated it with a "my $FOLLOW;" and got things working again, rewriting your the code to fit your skeleton as recommended. I didn't understand what $FOLLOW was supposed to be doing. I suspect that the regex is there to tell the spider not to try to evaluate pdf files and the like but only those ending in htm and html. I don't how $FOLLOW magically works. I'm certainly willing to write in a regex that only looks for htm, html, php and asp pages if $FOLLOW needs that definition. As for Ebola, I suspecting now that you mean I should rewrite that as a routine for the img alt attribute substitution. I'm not sure exactly how to do that. At any rate, the code for the images now works in the body of the botulism sub (I get the joke about it being an infected bot...). However the code for the meta tags does not. I don't understand how substitutions are being made in the $tree for title and img but not splicing in of the new meta tags... You are right, your code does properly spider when I made $FOLLOW initiated with "my". The code below produces results, and now spiders correctly, it just doesn't inject the new meta tags. #!/usr/bin/perl -- use warnings; use strict; use WWW::SimpleRobot; use HTML::Entities; require HTML::Parser; use Lingua::EN::Summarize; use HTML::TreeBuilder; use Lingua::EN::Keywords; use HTML::Tree; use LWP::Simple; Main( @ARGV ); exit( 0 ); sub Main { my @urls = @_; # or hardcode them here my $FOLLOW; my $robot = WWW::SimpleRobot->new( URLS => \@urls, FOLLOW_REGEX => $FOLLOW, DEPTH => 2, TRAVERSAL => 'depth', VISIT_CALLBACK => \&Botulism, BROKEN_LINK_CALLBACK => \&Snicklefritz, ); eval { $robot->traverse; 1 } or warn "robot died, but we caught it +: $@ "; } sub MAlt { my $imgscalar = $_[0]; my $imgsrc = $imgscalar->attr('src'); use File::Basename; my @suffixlist = qw(.gif .jpg .jpeg .png .bmp .php .ico .GIF .JPG .JPE +G .PNG .BMP .PHP .ICO); my $imgfilenopathnoext = fileparse($imgsrc,@suffixlist); '!' . $imgfilenopathnoext; } sub Botulism { my ( $url, $depth, $html, $links ) = @_; print "\nURL: $url - depth $depth\n"; $html = decode_entities($html); $html =~ s/document\.write$.+?$\;//g; $html =~ s/\&\#.+?\;//g; my $tree = HTML::TreeBuilder->new(); $tree->parse($html); no warnings 'uninitialized'; eval { my $Title = substr $tree->look_down( '_tag', 'title' )->as_tex +t , 0, 65; print "Title exists and is: $Title.\n"; } or do { my $Title; for my $tag( qw' h1 h2 h3 h4 p ' ){ last if eval { $Title = substr $tree->look_down( '_tag', $tag )->as_t +ext , 0, 65; if( length $Title ){ $html->push_content($Title); print "No title was found so the first $tag tag co +ntents \n were written to the title field in the header.\n"; } } } unless($Title){ print "No title exists and no suitable \ntext was found by this bot to use as one.\n"; } }; my $filteredhtml = summarize( $html, filter => 'html' ); my $summary = summarize( $filteredhtml, maxlength => 500 ) +; $summary =~ s/\s+/ /gs; my $var = substr($summary, 0, 155); print "Using Lingua::EN::Summarize Summary: $var\n\n"; local $\ = $/; $_= HTML::Element->new('meta', 'content' => "$var", 'name' + => 'description'); print $_->as_HTML for $tree->look_down(qw' _tag head '); my $title = substr $tree->look_down( '_tag', 'title' )->as +_text , 0, 65; my @keywords = keywords($title.$summary); print "Keywords: " . join(", ", @keywords) . "\n\n"; local $\ = $/; $_= HTML::Element->new('meta', 'content' => "@keywords", ' +name' => 'keywords'); print $_->as_HTML for $tree->look_down(qw' _content head ' +); local $\ = $/; print $_->as_HTML for $tree->look_down( '_tag', 'img ', sub { not defined $_[0]->attr('alt') } ); print '---'; print $_->as_HTML for $tree->look_down( qw' _tag img ', sub { not length $_[0]->attr('alt') } ); print '---'; $_->attr( alt => MAlt($_) ) for $tree->look_down( qw' _tag img ', sub { not length $_[0]->attr('alt') } ); print $_->as_HTML for $tree->look_down(qw' _tag img '); print $tree->as_HTML; $tree = $tree->delete; } sub Snicklefritz { my ( $url, $linked_from, $depth ) = @_; print "The link $url from the page $linked_from at depth $depth\n appears to be broken. please repair the link manually\n"; } sub Ebola { my( $html, $clip, $text ) = @_; if(defined $text and length $text ) { $text = substr $text, 0, $clip; $html->push_content( $text ); } } [download]	[reply] [d/l]
Re^3: SEO Fixer Part II - Updated by Anonymous Monk on Apr 06, 2011 at 20:24 UTC
As for Ebola, I suspecting now that you mean I should rewrite that as a routine for the img alt attribute substitution. I'll try to explain about Ebola, I'll try be clear. In Re^3: Trying to Insert Alt Tags Programmatically you posted this (I've run it through perltidy ) Read more... (2 kB) Every time you write repetitive stuff like that `$newtitleh1 ... $newtitleastexth1 ... $newtitleastexth1clipped ... $html->push_content ... ... $newtitleh2 ... $newtitleastexth2 ... $newtitleastexth2clipped ... $html->push_content ...` [download] you should turn it into a subroutine, which I did, and I called it Ebola (surely you noticed substr/push_content). To learn to Ebola you would write a program like this (Ebola.pl) #!/usr/bin/perl -- # Ebola.pl use strict; use warnings; use HTML::Tree; Main( @ARGV ); exit( 0 ); sub Main { my $t = HTML::TreeBuilder->new_from_content(join'','<h4>',0..9,'<h +4>'); my $f = HTML::TreeBuilder->new_from_content('<title>f</title>'); my $B = $f->look_down( qw' _tag body ' ); print $t->as_HTML, "\n\n"; print '-'x33, "\n\n"; if ( Ebola( $B, 5, eval{ $t->look_down(qw'_tag h1')->as_text } ) ) + { print "Using h1\n"; } elsif ( Ebola( $B, 5, eval{ $t->look_down(qw'_tag h2')->as_text +} ) ) { print "Using h2\n"; } elsif ( Ebola( $B, 5, eval{ $t->look_down(qw'_tag h3')->as_text +} ) ) { print "Using h3\n"; } elsif ( Ebola( $B, 5, eval{ $t->look_down(qw'_tag h4')->as_text +} ) ) { print "Using h4\n\n"; } print '-' x 33, "\n\n"; print $f->as_HTML, "\n\n"; } ## end sub Main sub Ebola { my ( $html, $clip, $text ) = @_; if ( defined $text and length $text ) { $text = substr $text, 0, $clip; $html->push_content($text); } } ## end sub Ebola __END__ [download] And the output of this program is `<html><head></head><body><h4>0123456789</h4><h4></h4></body></html> --------------------------------- Using h4 --------------------------------- <html><head><title>f</title></head><body>01234</body></html>` [download] This is how you learn to Ebola, by writing a small program (Ebola.pl) to explore how HTML::TreeBuilder/HTML::Element objects behave. For the next step in the process, you write a program called Ebola.t `sub Main { AsdfQwerty( '<h1>0123456789</h1>', '01234' ); AsdfQwerty( '<h1>0123456789</h2>', '01234' ); AsdfQwerty( '<h3>0123456789</h3>', '01234' ); AsdfQwerty( '<h4>0123456789</h4>', '01234' ); AsdfQwerty( '<h5>0123456789</h5>', '01234' ); AsdfQwerty( '<p>0123456789</p>', '01234' ); }` [download] Where AsdfQwerty(), like Ebola.pl, puts Ebola() through its paces. That is, AsdfQwerty(), tests Ebola(), by feeding Ebola() various inputs, and checking that the output Ebola() produces meets your expectations. Problems are easier to spot and fix in very small programs. Forget about your big task (SEO Fixer) for the moment, write 5 or 10 of these little programs, each dealing with a single small task, a single function (Ebola). But don't call it Ebola, Ebola isn't descriptive, it doesn't explain or even hint at what the subroutine is supposed to accomplish or demonstrate/teach you. Do you follow what I'm saying? The reason I chose names like Ebola is because you're supposed to change them. Learning to program is hard, there is a lot of information you have to juggle inside your head, and only sticks if you write/rewrite code yourself. You still have part of my demo prog in this post i'm replying to `local $\ = $/; print $_->as_HTML for $tree->look_down( '_tag', 'img ', sub { not defined $_[0]->attr('alt') } ); print '---'; print $_->as_HTML for $tree->look_down( qw' _tag img ', sub { not length $_[0]->attr('alt') } ); print '---'; $_->attr( alt => MAlt($_) ) for $tree->look_down( qw' _tag img ', sub { not length $_[0]->attr('alt') } ); print $_->as_HTML for $tree->look_down(qw' _tag img '); print $tree->as_HTML; $tree = $tree->delete;` [download] This code was supposed to demonstrate/teach how look_down works, and how to modify any tags you might find, it wasn't meant for direct inclusion in your program, after all, I'm not writing your program :) Its like learning to juggle First you start with one ball and then practice with two balls and two in one hand then three balls. Next you start with one bowling pin and then two bowling pins and then two bowling pins in one hand and then three bowling pins Next you start with one ball and one bowling pin and then one ball and one bowling pin in one hand and then two balls and one bowling pin Next you start with one two bowling pin and one ball And then you start with one knife and then two knives then two knives in one hand and then three knives Again you start with one ball and one knife then one ball and one knife in one hand then one knife and two balls Eventually you work your way up to three chainsaws, two bowling pins, two knives, and four tennis balls In this analogy, each juggling session is a new small program you've written. The more of them you write, the easier they get to write, the easier they get to combine, and if you forget how something works, you've got tons of little programs to show you. I know this must seem overwhelming (hey, I'm typing my fingers bloody here :D) so here is the last surprise, AsdfQwerty is better written using Test::More, like http://cpansearch.perl.org/src/PETEK/HTML-Tree-3.23/t/attributes.t Read more... (1022 Bytes) But it isn't necessary, its just motorized equipment, chainsaw versus handsaw. Good luck.	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom