Well, one of our servers is seriously down at work, shutting down what I'm working on, so I figured I'd put out a small request for comments.

I've recently released HTML::TokeParser::Simple version 1.3, a Perl extension for parsing HTML documents. Simple HTML to Text converter:

use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( $somefile ); while ( my $token = $p->get_token ) { next if ! $token->is_text; print $token->return_text; }

The latest version has a bit of code cleanup and adds a new "is_tag" method. For example, to print everything in an HTML document that is *not* a valid HTML tag:

use HTML::TokeParser::Simple; use HTML::Tagset; my $p = HTML::TokeParser::Simple->new( \$html ); while ( my $token = $p->get_token ) { next if $token->is_tag and exists $HTML::Tagset::isKnown{ $token->return_tag }; print $token->return_text; }

Also, the is_end_tag() method no longer cares whether or not you have a leading forward slash. The following two lines are equivalent:

$token->is_end_tag( '/form' ); $token->is_end_tag( 'form' );

However, I'm also considering pushing support for HTML::Tagset (either optional or mandatory, I don't know which) directly into this module. This would allow you to use all of the following code snippets.

Allow a program to print out all text in an HTML document, but skipping valid HTML tags:

while ( my $token = $p->get_token ) { # the following would skip <p>, but not <pr> next if $token->is_valid_tag; print $token->return_text; }

And wouldn't this be handy?

if ( $token->can_link ) { # check to see if it's really linking to something }

The can_link() method would return true for tags such as "script", "a", "td" ("background=..."), etc.

Or what about the following:

$token->is_head_element; $token->is_table_element; $token->is_body_element; # etc.

If you like or dislike that, let me know. I realize that this might be overkill, but since so many of these methods cover what seem to be common cases of parsing HTML, it seems reasonable to give people easy methods of checking for them. It's kind of like one-stop shopping for many of your HTML parsing needs. On the other hand, it goes far beyond the original intent of the module.

I'm also considering deprecating the "return_foo" methods and dropping the "return_". This change I'm more cautious about.

$token->return_attr; # becomes $token->attr; $token->return_text; # becomes $token->text;

It seems like it would be simpler. While some of those method names also show up in the HTML::Parser documentation, they all appear to be user-defined callbacks, so I don't think there's an issue there.

Cheers,
Ovid

Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.


In reply to RFC: Wider scope for HTML::TokeParser::Simple by Ovid

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.