RFC: HTML::TokeParser::Easy

Important update: I forgot one of the most important points about this module. Like CGI::Safe, it's a drop-in replacement. You can use this in place of HTML::TokeParser and it's completely transparent. Then, just use the new methods where you feel the need to get the greater clarity. This simplifies migration to this module.

Update 2: crazyinsomniac is right: the AUTOLOAD has got to go. When the methods were simple, the AUTOLOAD sort of made sense. Now, however, I need to overload a couple of them and even pulling them out and refactoring the rest into an AUTOLOAD just doesn't have enough benefit to justify the obfuscation value. Darn it. If anyone else had written this, I would have been the first to point that out. How reluctant we are to admit that our children are ugly :)

After prompting from a couple of monks, I finally got off my duff and finished up the HTML::TokeParser::Easy module. This is basically an adaptor for HTML::TokeParser that makes the module easier to use (no more memorizing array indices). For example, with HTML::TokeParser, if you want to find out if a token returned from get_token() is a start token and a form token, you would do this:

    if ( $token->[0] eq 'S' and $token->[1] eq 'form' ){...}
[download]

Now, you just do this:

    if ( $token->is_start_tag( 'form' ) ){...}
[download]

Is a token a comment?

    if ( $token->is_comment ){...}
[download]

That was originally $token->[0] eq 'C'.

Need the attributes of a given token?

    my $attributes = $token->attr;
[download]

That code was $token->[3], or $token->[2], depending upon how you generated the token. Now, it's one standard method.

If this interests you and you use HTML::TokeParser, please download and test the distribution. I haven't written the tests yet, but I won't upload to the CPAN without them. I at least managed to get POD written up :)

Cheers,
Ovid

Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Comment on RFC: HTML::TokeParser::Easy Select or Download Code

Replies are listed 'Best First'.

(crazyinsomniac : a request) Re: RFC: HTML::TokeParser::Easy
by crazyinsomniac (Prior) on Mar 02, 2002 at 01:36 UTC

a request

I'd also like a get_hashtoken method, wich would return a hashref token, that'd look someting like:

$VAR1 = { 'type' => 'S', # or T|E|C|D|PI
          'tag' => 'html',
          'attr' => {},
          'attrseq' => [],
          'text' => '<HTML>',
        };
[download]

my $token = $parser->get_hashtoken();
if($token->{type} eq 'S') {
    if(exists $token->{attrs}->{href}) {
        print Data::Dumper::DumperX($token);
    }
}
[download]

Also, please get rid of the autoload magic, you're not Lincoln Stein (it is unneccessary, you don't have that many methods like CGI.pm, and it slows things down, please .. if your arms are cramping up, i'll volunteer to do the typing).

update: on the AUTOLOAD issue, I too want to ask "Anyone else think that the AUTOLOAD should go?"

______crazyinsomniac_____________________________
Of all the things I've lost, I miss my mind the most.
perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

[reply]
[d/l]
[select]

(Ovid) Re: (2): RFC: HTML::TokeParser::Easy

by Ovid (Cardinal) on Mar 02, 2002 at 03:25 UTC

Hey, the hash functionality looks interesting. I'll keep that in mind. I'll look at how it's organized now to see how I could implement that.

As for the AUTOLOAD, I'm typically a B&D programmer. The only reason I chose that route is to make maintenance easier. Once it is structured correctly, if more stuff is added to HTML::TokeParser, I just add it to the hash and the methods are auto-generated. If people think that this is too serious an objection, I can take it out.

As for AUTOLOAD slowing things down, I don't think it does (though I haven't benchmarked it). In my experience, people really only use a few of the array elements from HTML::TokeParser. In this module, those elements translate to methods and once generated, they are added to the symbol table and the overhead is gone. Even if you called every method generated, you would probably add less than a second to the total runtime of the program (and if you called every method, than you have a huge program and that second is meaningless).

Anyone else think that the AUTOLOAD should go?

Cheers,
Ovid

P.S.: Thanks for the feedback :)

Update: Now that I think about it, the AUTOLOAD can be simplified by having the two overloaded (is_(start|end)_tag) functions moved into their own methods. Everything else is identical, though. Hmm... maybe it should go.

Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

[reply]

Re: RFC: HTML::TokeParser::Easy
by theguvnor (Chaplain) on Mar 02, 2002 at 06:39 UTC

get_tag()

..Guv

(Ovid) Re(2): RFC: HTML::TokeParser::Easy

by Ovid (Cardinal) on Mar 02, 2002 at 07:09 UTC

It does get around that problem, but it was tricky to pull off. The get_tag() method returned a token in the same format as a start or end tag token from the get_token() method, minus the first array element. I wound up "faking" the method call by unshifting on the appropriate array element, if the token was generated by get_tag(), and then shifting it back off at the end of the method call. It was ugly, but it solved problems that other strategies created (such as not breaking inheritance).

Cheers,
Ovid

Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

[reply]