comment on

Yesterday, a new monk was asking around the chatterbox for a regexp that could match an HTML image tag that doesn't have a alt attribute. Sounds easy to a newbie, but everyone who's ever tried dealing with HTML,using Regexes knows it's not. The reasons are obvious one has to deal with the possiblity of > and < tags in quotes, you don't know where a certain attribute is going to appear in a tag, etc etc. I'm no regex wizard and even the people on the 'box that were just said use HTML::*. I was one of these voices. But ever time I came back the same monk was repeating the same questions. I finally messaged that monk with a link to the following code.

#!/usr/bin/perl -w
#program to find img tags w/o alt attributes

use strict;
use HTML::TokeParser;

#build list of HTML files in the same directoy
my @files=<*>;
@files = grep(/[.]htm/i ,@files);

#parse each file
for my $file (@files) {
    my $p = HTML::TokeParser->new( $file );
     
    #move through each html token in the file
    while (my $token = $p->get_token){
        #find IMG start tags 
        if ($token->[0] eq "S" && $token->[1] =~ /img/i) {
             my $alt_count = 0;
             for my $token (keys %{$token->[2]}){
                    #if alt tag is found count it
                    ++$alt_count if $token =~ /alt/i;
             }
             if ($alt_count < 1){
                 #if we get here print a message and jump to the next 
+file
                 print "$file is missing an alt attribute in an img ta
+g\n";
                 last;
            }
        }    
            
    }
}
[download]

I tested it it works, it's easy to understand (if you read the HTML::TokeParser docs) and it presented an arguement for UTFM, over roll yer own. I did not get a message back from this monk. My assumption is that he's somewhere else now asking the same questions about negative lookahead and whatnot.

My point? I just don't understand the fear associated with using a module. the alternative is much scarier to me. decode CGI variables? parse HTML? I'm very busy, and I have other work to do. I thank the Perl gods for CPAN.

In reply to UTFM - Use the Friendly Modules by thunders

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.