Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Trying to Insert Alt Tags Programmatically

by socrtwo (Sexton)
on Apr 02, 2011 at 13:27 UTC ( [id://897084]=perlquestion: print w/replies, xml ) Need Help??

socrtwo has asked for the wisdom of the Perl Monks concerning the following question:

Although it may be unwise, I'm trying to insert the image names as image alt attributes in web pages. I'm using HTML::Element, but can't seem to get it quite right. I think I know how to create a new tag but not how to creat a new attribute within say all img tags that don't have alt attributes.

Here's what I have:

$tree = HTML::TreeBuilder->new_from_content($html); foreach my $img ($tree->look_down('_tag', 'img')) { my $alatt = $img->attr('alt') if (!$alatt) { $imgsrcattribute = $img->attr('src'); use File::Basename; my @suffixlist = qw(gif, jpg, jpeg, png, bmp, php, ico, GIF, J +PG, JPEG, PNG, BMP, PHP, ICO); my $imgfilenopathnoext = fileparse($imgsrcattribute,@suffixlis +t); my $newalttag = HTML::Element->new('img', alt=>""); $newalttag->attr('alt', "$imgfilenopathnoext"); $html->push_content($imgfilenopathnoext); $imgfilenopathnoext = $imgfilenopathnoext->delete; print "Alt tag not found for $imgsrcattribute \nInserted src f +ile name $imgfilenopathnoext for alt attribute"; } }

So how do I find all the images that don't have alt tags and insert the image file name (not the full path or extension) as the alt attribute?

I thought of another possibility of using the nearest non-html content text as the alt tag "suggestion". If anybody has any ideas for other ways to automatically create the alt tag contents I'd be happy to hear.

The dream way to make alt tags automatically (or at least have a temporary space filler) would be to hook into a semantic image search API and have the image identified some way. For instance possibly using this service: http://www.revimg.net/add.php. However for this to work, with a Google Goggles type program, the service would have to make correct identity guesses pretty accurately.

Replies are listed 'Best First'.
Re: Trying to Insert Alt Tags Programmatically
by toolic (Bishop) on Apr 02, 2011 at 15:52 UTC
    Probably unrelated to your problem, but do you really want commas? If not, get rid of them:
    use warnings; use strict; my @suffixlist = qw(gif, jpg, jpeg, png, bmp, php, ico, GIF, JPG, JPEG +, PNG, BMP, PHP, ICO); print "$suffixlist[0]\n"; __END__ Possible attempt to separate words with commas at gif,
    use strict and warnings
      Thanks for the heads up :-).

        Hi. Thanks to the anonymous monk and toolic. I'm a little further on. Below is my full code now. What I want the Perl to do is spider a local folder recursively, maybe 2 level deep and then write titles, meta descriptions, meta keywords and alt attributes for images if they are missing.

        My biggest issue might be that the Perl in the script only accepts http paths as an argument, not any configured local path as far as I can tell. So the following don't work:

        • file:///C:/Users/socrtwo/Desktop/wgts/index.html
        • C:/Users/socrtwo/Desktop/wgts/index.html
        • C:\Users\socrtwo\Desktop\wgts\index.html
        • index.html

        This poses an issue because I don't want to even try to update the files online via FTP, I just want to do it in a local folder. So how do I change the spider to accept a local file name? Thanks in advance.

        The nice initial spider and description + keyword suggester comes courtesy of Troy (davistv): http://www.perlmonks.org/index.pl?node_id=267758

        #!/usr/bin/perl package Metabot; use warnings; use strict; use WWW::SimpleRobot; use HTML::Entities; require HTML::Parser; use Lingua::EN::Summarize; use HTML::Summary; use HTML::TreeBuilder; use Lingua::EN::Keywords; use HTML::Tree; use LWP::Simple; @Metabot::ISA = qw(HTML::Parser); my $url = $ARGV[0]; my $parser = Metabot->new; my $robot = WWW::SimpleRobot->new( URLS => [ $url ], FOLLOW_REGEX => "^$url", DEPTH => 2, TRAVERSAL => 'depth', VISIT_CALLBACK => sub { my ( $url, $depth, $html, $links) = @_; print "\nURL: $url - depth $depth\n"; $html = decode_entities($html); $html =~ s/document\.write\(.+?\)\;//g; $html =~ s/\&amp;\#.+?\;//g; my ($tree, $title, $titleastext, $newtitle, $newtitleh1, $ +newtitleastexth1, $newtitleastexth1clipped, $newtitleh2, $newtitleast +exth2, $newtitleastexth2clipped, $newtitleh3, $newtitleastexth3, $new +titleastexth3clipped, $newtitleh4, $newtitleastexth4, $newtitleastext +h4clipped, $newtitlep, $newtitleastextp, $newtitleastextpclipped, $su +mmary, $var, $newmetadescription, $newmetakeywords); $tree = HTML::Tree->new(); $tree->parse($html); $title = $tree->look_down( '_tag' , 'title' ); $titleastext = $title->as_text; use HTML::Element; if ($titleastext){ print "\nTitle: $titleastext\n\n"; } else { $newtitle = HTML::Element->new('title'); $newtitle = $newtitleh1; $newtitleh1 = $tree->look_down( '_tag' , 'h1' ); if ($newtitleh1){ $newtitleastexth1 = $newtitleh1->as_text; } } if ($newtitleastexth1){ $newtitleastexth1clipped = substr($newtitleastexth1, 0, 65 +); $html->push_content($newtitleastexth1clipped); print "\n$url does not have a title. We created one from\n + the first 66 characters your first headline tag \<h1\>:\n $newtitleastexth1clipped.\n Please change if desired.\n\n"} else { $newtitleh2 = $tree->look_down( '_tag' , 'h2' ); if ($newtitleh2){ $newtitleastexth2 = $newtitleh2->as_text; } } if ($newtitleastexth2){ $newtitleastexth2clipped = substr($newtitleastexth2, 0, 65 +); $html->push_content($newtitleastexth2clipped); print "\n$url does not have a title. We created one from\n + the first 66 characters your first headline tag \<h2\>:\n $newtitleastexth2clipped.\n Please change if desired.\n\n" } else { $newtitleh3 = $tree->look_down( '_tag' , 'h3' ); if ($newtitleh3){ $newtitleastexth3 = $newtitleh3->as_text; } } if ($newtitleastexth3){ $newtitleastexth3clipped = substr($newtitleastexth3, 0, 65 +); $html->push_content($newtitleastexth3clipped); print "\n$url does not have a title. We created one from\n + the first 66 characters your first headline tag \<h3\>:\n $newtitleastexth3clipped.\n Please change if desired.\n\n" } else { $newtitleh4 = $tree->look_down( '_tag' , 'h4' ); if ($newtitleh4){ $newtitleastexth4 = $newtitleh4->as_text; } } if ($newtitleastexth4){ $newtitleastexth4clipped = substr($newtitleastexth4, 0, 65 +); $html->push_content($newtitleastexth4clipped); print "\n$url does not have a title. We created one from\n + the first 66 characters your first headline tag \<h4\>:\n $newtitleastexth4clipped.\n Please change if desired.\n\n" } else { $newtitlep = $tree->look_down( '_tag' , 'p' ); if ($newtitlep){ $newtitleastexth3 = $newtitlep->as_text; } } if ($newtitleastextp){ $newtitleastextpclipped = substr($newtitleastextp, 0, 65); $html->push_content($newtitleastextpclipped); print "\n$url does not have a title. We created one from\n + the first 66 characters your first paragraph tag \<p\>:\n $newtitleastexth1clipped.\n Please change if desired.\n\n" } else { print "$url does not have a title and we are unable to sug +gest any."; } $tree = new HTML::TreeBuilder; $tree->parse($html); $summary = summarize( $html, filter => 'html', maxlength = +> 500 ); $summary =~ s/\s+/ /gs; $var = substr($summary, 0, 155); print "Using Lingua::EN::Summarize Summary: $var\n\n"; $newmetadescription = HTML::Element->new('meta', href => " +$html"); $newmetadescription->attr('name', 'description'); $newmetadescription->attr('content', "$var"); $html->push_content($newmetadescription); $newmetadescription = $newmetadescription->delete; my @keywords = keywords($title.$summary); print "Keywords: " . join(", ", @keywords) . "\n\n"; $newmetakeywords = HTML::Element->new('meta', href => "$ht +ml"); $newmetakeywords->attr('name', 'keywords'); $newmetakeywords->attr('content', "@keywords"); $html->push_content($newmetakeywords); $newmetakeywords = $newmetakeywords->delete; # $tree = HTML::TreeBuilder->new_from_content($html); # foreach my $img ($tree->look_down('_tag', 'img')) { # my $alatt = $img->attr('alt') # if (!$alatt) { # $imgsrcattribute = $img->attr('src'); # use File::Basename; # my @suffixlist = qw(gif jpg jpeg png bmp php ico GIF JPG JPE +G PNG BMP PHP ICO); # my $imgfilenopathnoext = fileparse($imgsrcattribute,@suffixl +ist); # my $newalttag = HTML::Element->new('img', alt=>""); # $newalttag->attr('alt', "$imgfilenopathnoext"); # $html->push_content($imgfilenopathnoext); # $imgfilenopathnoext = $imgfilenopathnoext->delete; # print "Alt tag not found for $imgsrcattribute \nInserted src + file name $imgfilenopathnoext for alt attribute"; # } # } Second(@ARGV); exit(0); sub Second { my ( $url, $depth, $html, $links) = @_; print "\nURL: $url - depth $depth\n"; $html = decode_entities($html); $html =~ s/document\.write\(.+?\)\;//g; $html =~ s/\&amp;\#.+?\;//g; my $tree = HTML::TreeBuilder->new(); $tree->parse($html); local $\ = $/; print $_->as_HTML for $tree->look_down( qw' _tag img ', sub { not defined $_[0]->attr('alt') } ); print '---'; print $_->as_HTML for $tree->look_down( qw' _tag img ', sub { not length $_[0]->attr('alt') } ); print '---'; $_->attr( alt => MAlt($_) ) for $tree->look_down( qw' _tag img ', sub { not length $_[0]->attr('alt') } ); print $_->as_HTML for $tree->look_down(qw' _tag img '); } ## end sub Main sub MAlt { my $imgscalar = $_[0]; my $imgsrc = $imgscalar->attr('src'); use File::Basename; my @suffixlist = qw(gif jpg jpeg png bmp php ico GIF JPG JPEG PNG BMP +PHP ICO); my $imgfilenopathnoext = fileparse($imgsrc,@suffixlist); '!' . $imgfilenopathnoext; } } , BROKEN_LINK_CALLBACK => sub { my ( $url, $linked_from, $depth ) = @_; print STDERR "$url looks like a broken link on $linked_fro +m\n"; print STDERR "Depth = $depth\n"; } ); $robot->traverse; my @urls = @{$robot->urls}; my @pages = @{$robot->pages}; for my $page ( @pages ) { my $url = $page->{url}; my $depth = $page->{depth}; my $modification_time = $page->{modification_time}; } sub text { my ($self,$text) = @_; $self->{TEXT} .= $text; }

        The current error when crawling http://s2services.com is:

        Can't call method "push_content" without a package or object reference + at seo_f xer.pl line 126.

        The error for trying to crawl local files is similar to: C:%5CUsers%5Csocrtwo%5CDesktop%5Cwgts%5Cindex.html is not a web page

        or

        index.html is not a valid URL
Re: Trying to Insert Alt Tags Programmatically
by Anonymous Monk on Apr 02, 2011 at 15:05 UTC
    Here's what I have:

    Nice fragment :) here is a complete program

    #!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder; Main(@ARGV); exit(0); sub Main { my $html = '<html><body> <img src="0" alt="0"> <img src="1" alt="1"> <img src="2" alt="2"> <img src="3" alt=""> <img src="4"> <img src="5"> </body></html>'; my $tree = HTML::TreeBuilder->new(); $tree->parse($html); local $\ = $/; print $_->as_HTML for $tree->look_down( qw' _tag img ', sub { not defined $_[0]->attr('alt') } ); print '---'; print $_->as_HTML for $tree->look_down( qw' _tag img ', sub { not length $_[0]->attr('alt') } ); print '---'; $_->attr( alt => MAlt($_) ) for $tree->look_down( qw' _tag img ', sub { not length $_[0]->attr('alt') } ); print $_->as_HTML for $tree->look_down(qw' _tag img '); } ## end sub Main sub MAlt { '!' . $_[0]->attr('src') } __END__ <img src="4" /> <img src="5" /> --- <img alt="" src="3" /> <img src="4" /> <img src="5" /> --- <img alt="0" src="0" /> <img alt="1" src="1" /> <img alt="2" src="2" /> <img alt="!3" src="3" /> <img alt="!4" src="4" /> <img alt="!5" src="5" />
    Alternatively, try HTML::TreeBuilder::XPath
    #!/usr/bin/perl -- use strict; use warnings; use HTML::TreeBuilder::XPath; Main(@ARGV); exit(0); sub Main { my $html = '<html><body> <img src="0" alt="0"> <img src="1" alt="1"> <img src="2" alt="2"> <img src="3" alt=""> <img src="4"> <img src="5"> </body></html>'; my $tree = HTML::TreeBuilder::XPath->new(); $tree->parse($html); local $\ = $/; print $_->as_HTML for $tree->findnodes('//img[not(@alt)]'); print '---'; print $_->as_HTML for $tree->findnodes('//img[not(string-length(@a +lt))]'); print '---'; $_->attr( alt => MAlt($_) ) for $tree->findnodes('//img[not(string-length(@alt))]'); print $_->as_HTML for $tree->findnodes('//img'); } ## end sub Main sub MAlt { '!' . $_[0]->attr('src') } __END__
      Wow, thanks. I will try them :-).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://897084]
Front-paged by tye
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (4)
As of 2024-04-19 03:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found