UTFM - Use the Friendly Modules

Yesterday, a new monk was asking around the chatterbox for a regexp that could match an HTML image tag that doesn't have a alt attribute. Sounds easy to a newbie, but everyone who's ever tried dealing with HTML,using Regexes knows it's not. The reasons are obvious one has to deal with the possiblity of > and < tags in quotes, you don't know where a certain attribute is going to appear in a tag, etc etc. I'm no regex wizard and even the people on the 'box that were just said use HTML::*. I was one of these voices. But ever time I came back the same monk was repeating the same questions. I finally messaged that monk with a link to the following code.

#!/usr/bin/perl -w
#program to find img tags w/o alt attributes

use strict;
use HTML::TokeParser;

#build list of HTML files in the same directoy
my @files=<*>;
@files = grep(/[.]htm/i ,@files);

#parse each file
for my $file (@files) {
    my $p = HTML::TokeParser->new( $file );
     
    #move through each html token in the file
    while (my $token = $p->get_token){
        #find IMG start tags 
        if ($token->[0] eq "S" && $token->[1] =~ /img/i) {
             my $alt_count = 0;
             for my $token (keys %{$token->[2]}){
                    #if alt tag is found count it
                    ++$alt_count if $token =~ /alt/i;
             }
             if ($alt_count < 1){
                 #if we get here print a message and jump to the next 
+file
                 print "$file is missing an alt attribute in an img ta
+g\n";
                 last;
            }
        }    
            
    }
}
[download]

I tested it it works, it's easy to understand (if you read the HTML::TokeParser docs) and it presented an arguement for UTFM, over roll yer own. I did not get a message back from this monk. My assumption is that he's somewhere else now asking the same questions about negative lookahead and whatnot.

My point? I just don't understand the fear associated with using a module. the alternative is much scarier to me. decode CGI variables? parse HTML? I'm very busy, and I have other work to do. I thank the Perl gods for CPAN.

Comment on UTFM - Use the Friendly Modules Download Code

Replies are listed 'Best First'.

Re: UTFM - Use the Friendly Modules
by gav^ (Curate) on Mar 08, 2002 at 23:22 UTC

use HTML::TreeBuilder;
foreach (<*.html>) {
    my $tree = HTML::TreeBuilder->new_from_file($_);
    if ($tree->look_down('_tag', 'img', sub { !$_[0]->attr('alt') })) 
+{
        print "$_ has a missing alt tag\n";
    }
    $tree->delete;
}
[download]

gav^

[reply]
[d/l]

Re: UTFM - Use the Friendly Modules
by mrbbking (Hermit) on Mar 09, 2002 at 02:49 UTC

CPAN is enormous, with much duplication. How do I know which module to choose from a group of similarly-named modules?
Sometimes I like the challenge of solving the problem myself.

That's it. Maybe with more experience, I'd start to recognize more author names on CPAN and associate them with quality modules - to help me avoid spending too much time on the bad/old/poorly maintainted ones. That might help me avoid #1 more frequently.
I do tend to avoid modules that have not been updated recently, especially if they have low version numbers.

And - for what I do with Perl - I often have the luxury of playing with the problem on my own for a while. I like this.

Now, I *do* go to CPAN immediately when I know of a specific module that I think will be helpful. I've done this with Text::Template, CGI (of course), LibXML and others. To figure out what module to try, I come here and play with Super Search. But when I have a problem and don't know where to start to find a module, I'd rather try to figure it out myself first.

I'm not "afraid" of using a module. Just sometimes don't know where to start, and other times feel up to a challenge.

[reply]

Re: Re: UTFM - Use the Friendly Modules

by thunders (Priest) on Mar 09, 2002 at 04:40 UTC

I too can think of a few reasons that one would not want to, or would not be able to use CPAN or it's relatives. First off, windows users don't always have the ability to build cpan's unix-centric modules. Also some users indicate that they are deploying an app or CGI on a shared server, where they do not have root. So In my recommendations I try to list either modules that currently ship with perl, or modules that are trivial to make, nmake, cpan,or ppm and that have limited depedencies.

HTML::TokeParser fits within these constraints, as It came with my Linux and Activestate perl Build.

[reply]
[d/l]
[select]

Re: UTFM - Use the Friendly Modules
by cjf (Parson) on Mar 09, 2002 at 07:52 UTC

I just don't understand the fear associated with using a module. the alternative is much scarier to me. decode CGI variables? parse HTML?

Scarier to you because you know better by now. Try looking at this from a complete Perl novice's position. They see basically two options:

Use a module. First they have to understand exactly what a module is, then which one to use, then where to get it, then how to download it, then how to install it, then how to use it in their script.
Ask for a regex or, at most, a few short lines of code (their perspective, not mine), insert it in their existing script, done.

For most tasks (especially param or HTML parsing) using a module is by far the better choice. Convincing someone of this is the hard part. To the uninitiated, your image tag example seems pretty simple, why would they need a module for it?

Obviously you can't spend an hour discussing the various flaws in a regex with every monk that comes along. If they choose to ignore your advice in the first place that's up to them. They'll either eventually get tired of fixing buggy code and use modules, or they'll continue to write buggy code and won't be any competition in the job market ;-).

[reply]

Re: Re: UTFM - Use the Friendly Modules

by fuzzyping (Chaplain) on Mar 09, 2002 at 12:37 UTC

How to install? Although documentation on installing modules can be found, they're not always written for the newbie audience. I know that a lot of monks swear by the CPAN module. Yes, it's great when it works... but a frightening experience to a newbie when it doesn't. First, just installing the CPAN module can be an exercise in futility... the looping behavior with excessive output is just enough to make those little hairs on the back of your neck stand on end.

When you finally get CPAN installed, there's no guarantee it's always going to work. Try installing Net::SSH::Perl using CPAN. Good Luck. Dollars-to-doughnuts it will die on IDEA.pm installation. Just an example, YMMV.

Personally, I prefer to install my modules by hand. It's almost as easy (perl Makefile.pl && make && make test && make install) and really gives you a better idea of what's going on and where. I like to know what's being installed in my system, if it's conflicting with existing items, and if it has any dependencies (wash, rinse, repeat). Some of this need for control likely stems from my parallel concern with systems security.

Using most modules does not require that you're comfortable with OOP, but it sure suggests it, if you want to get the most out of the module. Yes, that's the whole design and purpose behind CPAN... reusable code. Nevertheless, it's a daunting task to learn OOP just so you can use one specific module.

[reply]

Re: UTFM - Use the Friendly Modules
by bluto (Curate) on Mar 09, 2002 at 00:42 UTC

As far as this monk goes, who knows? I haven't been following the discussion you've mentioned, but FWIW, I've seen this kind of behaviour a lot and my general impression is that these folks barely have their perl "sea legs" and are unprepared to figure out how to install new modules and maintain them. It is sad though when they don't even make an effort to just give it a go. It is sadder still, and rather insulting, when they don't give a reason as to why a module wont work for them. Even a lame excuse (e.g. "My boss is paranoid about code that we didn't write.") is better than silence since you can help correct the notion that "not written here" == bad. Newbies that have the "Right Stuff" will work it through and say "thanks", rather than regurgitate the same question.

bluto

[reply]

Re: Re: UTFM - Use the Friendly Modules

by rjray (Chaplain) on Mar 09, 2002 at 02:34 UTC

It's difficult to say why people would be adverse to using modules. It's very easy to recommend people to a given module, since the formatter for nodes recognizes [cpan://] and handles it so well. The person looking for HTML-parsing help was getting the right answers, and the people on the box were giving him the right answers. Even in SoPW messages, I see people point new users at the most useful CPAN modules, with no indication that the recommendation stuck. Fortunately, there are plenty of users who follow-up to their own question to let us know that they listened to, and benefitted from, our aid :-).

--rjray

[reply]

(crazyinsomniac) Re: UTFM - Use the Friendly Modules
by crazyinsomniac (Prior) on Mar 09, 2002 at 03:36 UTC

HTML::LinkExtor

So like you said, I too say use modules ;)

#!/usr/bin/perl -w
use strict;
use Data::Dumper qw/DumperX/;
use HTML::LinkExtor;
use vars qw/$file/;

die "pass args man" unless @ARGV;

my $p = new HTML::LinkExtor(\&foy);

for $file (@ARGV) {
    $p->parse_file($file);
}

sub foy {
   my($t,%A) = @_;

   if($t eq 'img'){
        warn DumperX(\%A).$file unless($A{alt});
   }
}
[download]

______crazyinsomniac_____________________________
Of all the things I've lost, I miss my mind the most.
perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

[reply]
[d/l]

Re: UTFM - Use the Friendly Modules
by demerphq (Chancellor) on Mar 11, 2002 at 13:25 UTC

The key point here was that he was unaware of the difficulties involved (despite certain venerable monks examples of the problems involved) and so incorrectly tried to use a simple approach that ultimately was doomed to failue.

Yves / DeMerphq
--
When to use Prototypes?
Advanced Sorting - GRT - Guttman Rosler Transform

[reply]


Perl: the Markov chain saw
	PerlMonks