Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

To anonymous, my apologies once again. I didn't mean to offend your sense of the effort I should be putting in and I really am grateful for the help you are giving me. I love Perl when I can figure it out, but have difficulty learning things in a classroom setting. Additionally a lot of CPAN is impenetrable and discouraging to me. I know I'm abusing Perl Monks and other source by hacking my way through this, but believe me, I'm learning a lot even though it's shocking often what I don't know, I'm sure. I have hacked my way through things successfully in the past you can see that by looking up my record here. I have a bunch of hacked Perl things I have placed on sourceforge too. Maybe my motivations are vainglorious, and am abusing Perl Monks to get there, but I also love each new bit I learn new about Perl. It's a very powerful language and despite my trouble maybe the next easiest after html.

I remember now a reason I may have gone back to the original. I also received an error about the $FOLLOW scalar not being initiated. I just initiated it with a "my $FOLLOW;" and got things working again, rewriting your the code to fit your skeleton as recommended. I didn't understand what $FOLLOW was supposed to be doing. I suspect that the regex is there to tell the spider not to try to evaluate pdf files and the like but only those ending in htm and html. I don't how $FOLLOW magically works. I'm certainly willing to write in a regex that only looks for htm, html, php and asp pages if $FOLLOW needs that definition.

As for Ebola, I suspecting now that you mean I should rewrite that as a routine for the img alt attribute substitution. I'm not sure exactly how to do that. At any rate, the code for the images now works in the body of the botulism sub (I get the joke about it being an infected bot...). However the code for the meta tags does not. I don't understand how substitutions are being made in the $tree for title and img but not splicing in of the new meta tags...

You are right, your code does properly spider when I made $FOLLOW initiated with "my". The code below produces results, and now spiders correctly, it just doesn't inject the new meta tags.

#!/usr/bin/perl -- use warnings; use strict; use WWW::SimpleRobot; use HTML::Entities; require HTML::Parser; use Lingua::EN::Summarize; use HTML::TreeBuilder; use Lingua::EN::Keywords; use HTML::Tree; use LWP::Simple; Main( @ARGV ); exit( 0 ); sub Main { my @urls = @_; # or hardcode them here my $FOLLOW; my $robot = WWW::SimpleRobot->new( URLS => \@urls, FOLLOW_REGEX => $FOLLOW, DEPTH => 2, TRAVERSAL => 'depth', VISIT_CALLBACK => \&Botulism, BROKEN_LINK_CALLBACK => \&Snicklefritz, ); eval { $robot->traverse; 1 } or warn "robot died, but we caught it +: $@ "; } sub MAlt { my $imgscalar = $_[0]; my $imgsrc = $imgscalar->attr('src'); use File::Basename; my @suffixlist = qw(.gif .jpg .jpeg .png .bmp .php .ico .GIF .JPG .JPE +G .PNG .BMP .PHP .ICO); my $imgfilenopathnoext = fileparse($imgsrc,@suffixlist); '!' . $imgfilenopathnoext; } sub Botulism { my ( $url, $depth, $html, $links ) = @_; print "\nURL: $url - depth $depth\n"; $html = decode_entities($html); $html =~ s/document\.write\(.+?\)\;//g; $html =~ s/\&\#.+?\;//g; my $tree = HTML::TreeBuilder->new(); $tree->parse($html); no warnings 'uninitialized'; eval { my $Title = substr $tree->look_down( '_tag', 'title' )->as_tex +t , 0, 65; print "Title exists and is: $Title.\n"; } or do { my $Title; for my $tag( qw' h1 h2 h3 h4 p ' ){ last if eval { $Title = substr $tree->look_down( '_tag', $tag )->as_t +ext , 0, 65; if( length $Title ){ $html->push_content($Title); print "No title was found so the first $tag tag co +ntents \n were written to the title field in the header.\n"; } } } unless($Title){ print "No title exists and no suitable \ntext was found by this bot to use as one.\n"; } }; my $filteredhtml = summarize( $html, filter => 'html' ); my $summary = summarize( $filteredhtml, maxlength => 500 ) +; $summary =~ s/\s+/ /gs; my $var = substr($summary, 0, 155); print "Using Lingua::EN::Summarize Summary: $var\n\n"; local $\ = $/; $_= HTML::Element->new('meta', 'content' => "$var", 'name' + => 'description'); print $_->as_HTML for $tree->look_down(qw' _tag head '); my $title = substr $tree->look_down( '_tag', 'title' )->as +_text , 0, 65; my @keywords = keywords($title.$summary); print "Keywords: " . join(", ", @keywords) . "\n\n"; local $\ = $/; $_= HTML::Element->new('meta', 'content' => "@keywords", ' +name' => 'keywords'); print $_->as_HTML for $tree->look_down(qw' _content head ' +); local $\ = $/; print $_->as_HTML for $tree->look_down( '_tag', 'img ', sub { not defined $_[0]->attr('alt') } ); print '---'; print $_->as_HTML for $tree->look_down( qw' _tag img ', sub { not length $_[0]->attr('alt') } ); print '---'; $_->attr( alt => MAlt($_) ) for $tree->look_down( qw' _tag img ', sub { not length $_[0]->attr('alt') } ); print $_->as_HTML for $tree->look_down(qw' _tag img '); print $tree->as_HTML; $tree = $tree->delete; } sub Snicklefritz { my ( $url, $linked_from, $depth ) = @_; print "The link $url from the page $linked_from at depth $depth\n appears to be broken. please repair the link manually\n"; } sub Ebola { my( $html, $clip, $text ) = @_; if(defined $text and length $text ) { $text = substr $text, 0, $clip; $html->push_content( $text ); } }

In reply to Re^2: SEO Fixer Part II - Updated by socrtwo
in thread SEO Fixer Part II - Updated by socrtwo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (6)
As of 2024-04-18 11:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found