use strict and TokeParser

young_stu has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: use strict and TokeParser
by Joost (Canon) on Nov 23, 2004 at 23:52 UTC

       $p->get_token
           This method will return the next token found in the HTML do
+cument, or "undef" at the
           end of the document.  The token is returned as an array ref
+erence.  The first element
           of the array will be a string denoting the type of this tok
+en: "S" for start tag, "E"
           for end tag, "T" for text, "C" for comment, "D" for declara
+tion, and "PI" for process
           instructions.  The rest of the token array depend on the ty
+pe like this:

             ["S",  $tag, $attr, $attrseq, $text]
             ["E",  $tag, $text]
             ["T",  $text, $is_data]
             ["C",  $text]
             ["D",  $text]
             ["PI", $token0, $text]
[download]

It appears you're trying to read the href attribute from a token without attributes (like an end tag or a text). Also not all html tags actually have an href attribute, which should give warnings if you've enabled them.

How about (untested):

while (my $token = $stream -> get_token()){
   if ($token->[0] eq 'S') { # start tag
     if (exists $token->[2]->{href}) { # tag has href attribute
        print "PDF link!\n" if $token -> [2] -> {'href'} =~  m/\.pdf/;
     }
   }
}
[download]

updated:

Your code "works" without strict because using $something->{href}, where $something is a string will reference a global hash named $something, creating it if it doesn't exist yet (i.e. if $something eq 'blah', a global hash %blah will be created if it doesn't already exists). This can cause all kinds of mayhem and is a good reason to always use strict (see strict 'refs' in the strict documentation and symbolic links in perlref)

updated: moved doc links to cpan.org, perldoc.com is messing up again.

"What should it profit a man, if he should win a flame war, yet lose his cool?"

[reply]
[d/l]
[select]

Re: use strict and TokeParser
by dave_the_m (Monsignor) on Nov 23, 2004 at 23:50 UTC

use HTML::TokeParser

Dave.

[reply]
[d/l]

Re: use strict and TokeParser
by Ovid (Cardinal) on Nov 24, 2004 at 03:02 UTC

You can make this easier and avoid those types of errors by switching to HTML::TokeParser::Simple and some defensive programming (it's easier to read, too.)

use strict;
use HTML::TokeParser::Simple;

my $filepath = "c:/folder/file.html";

my $stream = HTML::TokeParser::Simple->new($filepath) or die $!;
while (my $token = $stream->get_token){
    next unless $token->is_start_tag('a');
    print "PDF link!\n" if $token->get_attr('href') =~ m/\.pdf/;
}
[download]

Cheers,
Ovid

New address of my CGI Course.

[reply]
[d/l]

Re^2: use strict and TokeParser

by PodMaster (Abbot) on Nov 24, 2004 at 08:04 UTC

the straw that broke the camels back

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]

Re^3: use strict and TokeParser

by Ovid (Cardinal) on Nov 24, 2004 at 12:56 UTC

What are you talking about? That's a bug report for HTML::LinkExtractor, a completely different module. I've not received any bug reports for HTML::TokeParser::Simple; I'd want to know so I could quickly fix it.

Been hittin' the bottle again, eh Pod? :)

Cheers,
Ovid

New address of my CGI Course.

[reply]

Re^4: use strict and TokeParser

by PodMaster (Abbot) on Jun 02, 2005 at 12:22 UTC

Re^5: use strict and TokeParser

by Ovid (Cardinal) on Jun 03, 2005 at 15:39 UTC

Some notes below your chosen depth have not been shown here

Re^3: Bug reports instead of FUD, please.

by Ovid (Cardinal) on Nov 24, 2004 at 22:40 UTC

So you send me a private message telling me there are, in fact, problems with my module but you refuse to tell me what the problems are. If you can give me some clear, specific issues with HTML::TokeParser::Simple that need to be resolved, I'd be happy to deal with those issues. The truth is, the only bug report I can ever remember getting about this module was from you and that was a couple of years ago. Even then it was not a bug but a bad design decision on my part and I fixed it rather promptly.

So please tell me what the problems are with this module. If you can't think of any, don't go spreading FUD. If you can think of some, why didn't you tell me what they are before you started trashing my module? I didn't write the thing just to say I have something on the CPAN. I want it to actually be useful.

Update: After digging around, I find a cryptic entry in your changes file for HTML::LinkExtractor. All it says is that you stopped using HTML::TokeParser::Simple back in September of this year. There's no mention in the associated bug report that my module was at fault and had you even done something as simple as forward me the link, I would have happily fixed it. Now I'm going on vacation and (if the code is in fact buggy) I have to leave buggy code on the CPAN until I get back.

Cheers,
Ovid

New address of my CGI Course.

[reply]