Yesterday, a new monk was asking around the chatterbox for a regexp that could match an HTML image tag that doesn't have a alt attribute. Sounds easy to a newbie, but everyone who's ever tried dealing with HTML,using Regexes knows it's not. The reasons are obvious one has to deal with the possiblity of > and < tags in quotes, you don't know where a certain attribute is going to appear in a tag, etc etc. I'm no regex wizard and even the people on the 'box that were just said use HTML::*. I was one of these voices.
But ever time I came back the same monk was repeating the same questions. I finally messaged that monk with a link to the following code.
#!/usr/bin/perl -w
#program to find img tags w/o alt attributes
use strict;
use HTML::TokeParser;
#build list of HTML files in the same directoy
my @files=<*>;
@files = grep(/[.]htm/i ,@files);
#parse each file
for my $file (@files) {
my $p = HTML::TokeParser->new( $file );
#move through each html token in the file
while (my $token = $p->get_token){
#find IMG start tags
if ($token->[0] eq "S" && $token->[1] =~ /img/i) {
my $alt_count = 0;
for my $token (keys %{$token->[2]}){
#if alt tag is found count it
++$alt_count if $token =~ /alt/i;
}
if ($alt_count < 1){
#if we get here print a message and jump to the next
+file
print "$file is missing an alt attribute in an img ta
+g\n";
last;
}
}
}
}
I tested it it works, it's easy to understand (if you read the HTML::TokeParser docs) and it presented an arguement for UTFM, over roll yer own. I did not get a message back from this monk. My assumption is that he's somewhere else now asking the same questions about negative lookahead and whatnot.
My point? I just don't understand the fear associated with using a module. the alternative is much scarier to me. decode CGI variables? parse HTML? I'm very busy, and I have other work to do. I thank the Perl gods for CPAN.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.