Well, one of our servers is seriously down at work, shutting down what I'm working on, so I figured I'd put out a small request for comments.

I've recently released HTML::TokeParser::Simple version 1.3, a Perl extension for parsing HTML documents. Simple HTML to Text converter:

use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( $somefile ); while ( my $token = $p->get_token ) { next if ! $token->is_text; print $token->return_text; }

The latest version has a bit of code cleanup and adds a new "is_tag" method. For example, to print everything in an HTML document that is *not* a valid HTML tag:

use HTML::TokeParser::Simple; use HTML::Tagset; my $p = HTML::TokeParser::Simple->new( \$html ); while ( my $token = $p->get_token ) { next if $token->is_tag and exists $HTML::Tagset::isKnown{ $token->return_tag }; print $token->return_text; }

Also, the is_end_tag() method no longer cares whether or not you have a leading forward slash. The following two lines are equivalent:

$token->is_end_tag( '/form' ); $token->is_end_tag( 'form' );

However, I'm also considering pushing support for HTML::Tagset (either optional or mandatory, I don't know which) directly into this module. This would allow you to use all of the following code snippets.

Allow a program to print out all text in an HTML document, but skipping valid HTML tags:

while ( my $token = $p->get_token ) { # the following would skip <p>, but not <pr> next if $token->is_valid_tag; print $token->return_text; }

And wouldn't this be handy?

if ( $token->can_link ) { # check to see if it's really linking to something }

The can_link() method would return true for tags such as "script", "a", "td" ("background=..."), etc.

Or what about the following:

$token->is_head_element; $token->is_table_element; $token->is_body_element; # etc.

If you like or dislike that, let me know. I realize that this might be overkill, but since so many of these methods cover what seem to be common cases of parsing HTML, it seems reasonable to give people easy methods of checking for them. It's kind of like one-stop shopping for many of your HTML parsing needs. On the other hand, it goes far beyond the original intent of the module.

I'm also considering deprecating the "return_foo" methods and dropping the "return_". This change I'm more cautious about.

$token->return_attr; # becomes $token->attr; $token->return_text; # becomes $token->text;

It seems like it would be simpler. While some of those method names also show up in the HTML::Parser documentation, they all appear to be user-defined callbacks, so I don't think there's an issue there.

Cheers,
Ovid

Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Replies are listed 'Best First'.
Re: RFC: Wider scope for HTML::TokeParser::Simple
by PodMaster (Abbot) on Jul 10, 2002 at 07:42 UTC
    Why?

    Honestly why?

    Ok, forget that for a second, do you have any clue as to what kind of a user base you've got (or rather your module -- how many people use it)?

    In the very least I think you should either do like the HTML::Parser folk did, and make with a

    HTML::TokeParser::Simple->new ( api_version => 3 );
    or just go with a namespace change HTML::TokeParser::Simpler. This really depends on if you're going to put in the work to keep the old style api, or just changing things around completely (even if its mostly cosmetic).

    Other than that I only have a question, are you going to go with AUTOLOAD now?

      First, the AUTOLOAD has been gone ever since I released the first version. That's not an issue (of course, if the module works as advertised, I don't think that leaving the AUTOLOAD in would have been that much of an issue, either, but I digress...)

      podmaster wrote: "Why? Honestly why?"

      Ovid responds: "why what?" I raised several issues there. Which one are you asking about? I think you're talking about the interface change, but rather answer a question you may not be asking, I'll just ask you to clarify your question :)

      Cheers,
      Ovid

      Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

        Why are you expanding the scope?

        I can understand the is_tag addition, but why add all that HTML::Tagset stuff? ( it wouldn't be simple no more )

        I know the AUTOLOAD has been gone, but with the numerous tags, would it be insane not to go with it (if you add that is_head is_body stuff)? Also, who's to say what's a valid tag , ie, what subset of html are you going to support (don't say whatever HTML::Tagset supports ;)?

        Seeing as i'm the only who's got anything to say, I say knock yourself out, but please keep in mind my previous comments on the interface (api version v. namespace change).