Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Hope everyone doesn't mind but I'm starting a new node to what I think is now fine tuning comparatively speaking to where I at started. Let me know if this is not proper.

So my aim is to create a script which I can eventually turn into a Perl⁄TK GUI app that allows some automatic correction of Search Engine Optimization (SEO) problems with websites. The script should look for missing titles, meta descriptions, meta keywords, and img tags with missing alt attributes.

For the missing titles the program tries to use the first h1, h2, h3, h4, or p tag content it encounters. For description it uses the Lingua::EN::Summarize module. For keywords it uses Lingua::EN::Keywords. For alt attributes it uses the file name minus the full path and extension elements. The dream eventually is to use an API for image online image matching app like Google Goggles to get a fix on a text description of the identity of the img to put in the alt attribute...but that is for later. Right now the img source file name is extracted with File::Basename. Finally it is using Simple::Robot to traverse websites and serve up URLs fpr processing.

Thanks to Anonymous Monk, toolic and GrandFather so far. As I'm sure is apparent I'm not only a neophyte acolyte Perl writer but programmer period, if that.

-----Update-----

I'm updating earlier code. I'm getting HTML output from below, but it is not spidering all the pages in a domain, maybe because of the REGEX statement, I don't know. Also there is some strange array output after the <body> tag. One reads for example:

" _contentheadtextHTML::Element=HASH(0x138563c) _contentheadtextHTML::Element=HASH(0x13a25fc)

I guess in reality the HTML::Element=HASH(0x138563c) and HTML::Element=HASH(0x13a25fc) are the the two meta tags I'm trying to insert into the header, so my problem still exists. I think this is a basic misunderstanding of HTML::Element, HTML::Tree and HTML::TreeBuilder so I'm going back to study these in CPAN.

#!/usr/bin/perl package Metabot; use warnings; use strict; use WWW::SimpleRobot; use HTML::Entities; require HTML::Parser; use Lingua::EN::Summarize; use HTML::Summary; use HTML::TreeBuilder; use Lingua::EN::Keywords; use HTML::Tree; use LWP::Simple; @Metabot::ISA = qw(HTML::Parser); my $url = $ARGV[0]; my $parser = Metabot->new; my $robot = WWW::SimpleRobot->new( URLS => [ $url ], FOLLOW_REGEX => "^$url", DEPTH => 2, TRAVERSAL => 'depth', VISIT_CALLBACK => \&Botulism, BROKEN_LINK_CALLBACK => \&Snicklefritz, ); $robot->traverse; my @urls = @{$robot->urls}; my @pages = @{$robot->pages}; for my $page ( @pages ) { my $url = $page->{url}; my $depth = $page->{depth}; my $modification_time = $page->{modification_time}; } sub Botulism { my ( $url, $depth, $html, $links ) = @_; print "\nURL: $url - depth $depth\n"; $html = decode_entities($html); $html =~ s/document\.write\(.+?\)\;//g; $html =~ s/\&amp;\#.+?\;//g; my $tree = HTML::TreeBuilder->new(); $tree->parse($html); no warnings 'uninitialized'; eval { my $Title = substr $tree->look_down( '_tag', 'title' )->as_tex +t , 0, 65; print "Title exists and is: $Title.\n"; } or do { my $Title; for my $tag( qw' h1 h2 h3 h4 p ' ){ last if eval { $Title = substr $tree->look_down( '_tag', $tag )->as_t +ext , 0, 65; if( length $Title ){ $html->push_content($Title); print "No title was found so the first $tag tag co +ntents \n were written to the title field in the header.\n"; } } } unless($Title){ print "No title exists and no suitable \ntext was found by this bot to use as one.\n"; } }; my $filteredhtml = summarize( $html, filter => 'html' ); my $summary = summarize( $filteredhtml, maxlength => 500 ) +; $summary =~ s/\s+/ /gs; my $var = substr($summary, 0, 155); print "Using Lingua::EN::Summarize Summary: $var\n\n"; local $\ = $/; my $newmetadescription = HTML::Element->new('meta', 'name' + => 'description', 'content' => "$var"); $tree->push_content("_content", "head", "text", "$newmetad +escription"); $newmetadescription = $newmetadescription->delete; my $title = substr $tree->look_down( '_tag', 'title' )->as +_text , 0, 65; my @keywords = keywords($title.$summary); print "Keywords: " . join(", ", @keywords) . "\n\n"; local $\ = $/; my $newmetakeywords = HTML::Element->new('meta', 'name' => + 'keywords', 'content' => "@keywords"); $tree->push_content("_content", "head", "text", "$newmetak +eywords"); $newmetakeywords = $newmetakeywords->delete; local $\ = $/; print $_->as_HTML for $tree->look_down( '_tag', 'img ', sub { not defined $_[0]->attr('alt') } ); print '---'; print $_->as_HTML for $tree->look_down( qw' _tag img ', sub { not length $_[0]->attr('alt') } ); print '---'; $_->attr( alt => MAlt($_) ) for $tree->look_down( qw' _tag img ', sub { not length $_[0]->attr('alt') } ); print $_->as_HTML for $tree->look_down(qw' _tag img '); print $tree->as_HTML; $tree = $tree->delete; } sub MAlt { my $imgscalar = $_[0]; my $imgsrc = $imgscalar->attr('src'); use File::Basename; my @suffixlist = qw(.gif .jpg .jpeg .png .bmp .php .ico .GIF .JPG .JPE +G .PNG .BMP .PHP .ICO); my $imgfilenopathnoext = fileparse($imgsrc,@suffixlist); '!' . $imgfilenopathnoext; } sub Snicklefritz { my ( $url, $linked_from, $depth ) = @_; print "The link $url from the page $linked_from at depth $depth\n appears to be broken. please repair the link manually\n"; } sub Ebola { my( $html, $clip, $text ) = @_; if(defined $text and length $text ) { $text = substr $text, 0, $clip; $html->push_content( $text ); } } }

Another question what is sub Ebola supposed to be doing?

.

In reply to SEO Fixer Part II - Updated by socrtwo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (None)
    As of 2024-04-25 04:04 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found