Well, one of our servers is seriously down at work, shutting down what I'm working on, so I figured I'd put out a small request for comments.
I've recently released HTML::TokeParser::Simple version 1.3, a Perl extension for parsing HTML documents. Simple HTML to Text converter:
use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( $somefile ); while ( my $token = $p->get_token ) { next if ! $token->is_text; print $token->return_text; }
The latest version has a bit of code cleanup and adds a new "is_tag" method. For example, to print everything in an HTML document that is *not* a valid HTML tag:
use HTML::TokeParser::Simple; use HTML::Tagset; my $p = HTML::TokeParser::Simple->new( \$html ); while ( my $token = $p->get_token ) { next if $token->is_tag and exists $HTML::Tagset::isKnown{ $token->return_tag }; print $token->return_text; }
Also, the is_end_tag() method no longer cares whether or not you have a leading forward slash. The following two lines are equivalent:
$token->is_end_tag( '/form' ); $token->is_end_tag( 'form' );
However, I'm also considering pushing support for HTML::Tagset (either optional or mandatory, I don't know which) directly into this module. This would allow you to use all of the following code snippets.
Allow a program to print out all text in an HTML document, but skipping valid HTML tags:
while ( my $token = $p->get_token ) { # the following would skip <p>, but not <pr> next if $token->is_valid_tag; print $token->return_text; }
And wouldn't this be handy?
if ( $token->can_link ) { # check to see if it's really linking to something }
The can_link() method would return true for tags such as "script", "a", "td" ("background=..."), etc.
Or what about the following:
$token->is_head_element; $token->is_table_element; $token->is_body_element; # etc.
If you like or dislike that, let me know. I realize that this might be overkill, but since so many of these methods cover what seem to be common cases of parsing HTML, it seems reasonable to give people easy methods of checking for them. It's kind of like one-stop shopping for many of your HTML parsing needs. On the other hand, it goes far beyond the original intent of the module.
I'm also considering deprecating the "return_foo" methods and dropping the "return_". This change I'm more cautious about.
$token->return_attr; # becomes $token->attr; $token->return_text; # becomes $token->text;
It seems like it would be simpler. While some of those method names also show up in the HTML::Parser documentation, they all appear to be user-defined callbacks, so I don't think there's an issue there.
Cheers,
Ovid
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: RFC: Wider scope for HTML::TokeParser::Simple
by PodMaster (Abbot) on Jul 10, 2002 at 07:42 UTC | |
by Ovid (Cardinal) on Jul 10, 2002 at 21:00 UTC | |
by PodMaster (Abbot) on Jul 12, 2002 at 07:23 UTC |