I'm working on HTML::TokeParser::Simple, version 3.0. Changes:

  1. Deprecate the return_* methods. $token->return_attr($foo); becomes $token->get_attr($foo) or $token->attr($foo).
  2. Have an optional (or automatic?) HTML entity encoding/decoding.
  3. Internals cleanup.
  4. Possibly allow auto-fetching from urls.

Item 1 is because return_foo is a horrible method name and I, the author, keep forgetting it. Still, this module is popular enough that I worry quite a bit about changing the API, even though I don't plan on removing deprecated methods for a long time.

However, the one thing tha keeps bugging me is my desire to solve vegasjoe's problem. This is probably the most common question that I field. If you have an HTML document in a file, it's easy to parse:

my $parser = HTML::TokeParser::Simple->new($file);

Of course, that's because I just inherit from HTML::TokeParser and don't worry about what's behind the scenes. However, what I really want to do is make this work:

my $parser = HTML::TokeParser::Simple->new($html_in_a_string);

Currently, people seem to get really confused because it's not intuitive to take a reference to a scalar to parse in-memory HTML. I'll probably do something like:

sub new_from_scalar { my ($class, $scalar) = @_; return $class->new(\$scalar); }

Ultimately, I think we'll have the following constructors:

my $parser1 = HTML::TokePaser::Simple->new($file); works my $parser2 = HTML::TokePaser::Simple->new($file_handle); my $parser3 = HTML::TokePaser::Simple->new_from_scalar($string); my $parser4 = HTML::TokePaser::Simple->new_from_fqdn($fqdn);

Feedback and advice welcome.

Cheers,
Ovid

New address of my CGI Course.

Replies are listed 'Best First'.
Re: HTML::TokeParser::Simple advice requested
by borisz (Canon) on Aug 13, 2004 at 19:00 UTC
    I find these two just to ugly.
    my $parser3 = HTML::TokePaser::Simple->new_from_scalar($string);
    Is just superfluous since we have already my $p = HTML::TokePaser::Simple->new(\$string);. If you want that novice users do the right, just give a example in the docs. A perl programmer has to learn about references anyway.

    BTW: If you remove the reference from the new method since it is to hard for new users to use that, you should consider to remove references at all from HTML::TokePaser::Simple. I look down to $p->get_token and think this is no option but who knows. And nearly the same goes for my $p = HTML::TokePaser::Simple->new_from_fqdn($fqdn); a example in the docs is enough.
    use LWP::Simple; my $content = get 'http://www.perlmonks.org/'; die "Can't get content" unless defined $content; HTML::TokePaser::Simple->new(\$content);
    Or, add a hashref with lots of new options instead of more new like methods.
    HTML::TokePaser::Simple->new({ url => 'http://www.perlmonks.org/' }); HTML::TokePaser::Simple->new({ scalar => $content });
    Boris
      Or, add a hashref with lots of new options instead of more new like methods.

      ++, but there's no need for a hash reference:

      my $parser1 = HTML::TokePaser::Simple->new(path => $file_name); my $parser2 = HTML::TokePaser::Simple->new(handle => $file_handle); my $parser3 = HTML::TokePaser::Simple->new(string => $string); my $parser4 = HTML::TokePaser::Simple->new(fqdn => $fqdn);

      And is there some reason a bit more auto-sensing couldn't be added to make these be typically implicit? Sure, to be safe, you'd want to use the two-argument form above, but for one-offs you could use the short form.

      my $parser1 = HTML::TokePaser::Simple->new($file_name); my $parser2 = HTML::TokePaser::Simple->new($file_handle); my $parser3 = HTML::TokePaser::Simple->new($long_string); my $parser4 = HTML::TokePaser::Simple->new($uri); sub new { my $class = shift; my ($mode, $target) = (@_ == 1 ? $class->guess_mode($_[0]) : (), @ +_); my $source = ( $mode eq 'path' ) ? $target : ( $mode eq 'stringref' ) ? $target : ( $mode eq 'string' ) ? \$target : do { my $method = "source_for +_$mode"; $class->$method( $target + ) }; $class->SUPER::new( $source ); } sub guess_mode { my $class = shift; ( ref($_[0]) =~ /^IO|FileHandle/) ? 'handle' : ( ref($_[0]) eq 'SCALAR' ) ? 'stringref' : ( $_[0] =~ /^\w{3-6}\:/ ) ? 'uri' : ( length($_[0]) > 1024 ) ? 'string' : 'path'; } sub source_for_uri { my ($class, $uri) = @_; # ... }
        Or, add a hashref with lots of new options instead of more new like methods. ++, but there's no need for a hash reference:
        my $parser1 = HTML::TokePaser::Simple->new(path => $file_name); my $parser2 = HTML::TokePaser::Simple->new(handle => $file_handle); my $parser3 = HTML::TokePaser::Simple->new(string => $string); my $parser4 = HTML::TokePaser::Simple->new(fqdn => $fqdn);
        I like it more. It is extensible, reusable and faster. Sure it is possible and a good solution too.
        sub guess_mode { my $class = shift; ( ref($_[0]) =~ /^IO|FileHandle/) ? 'handle' : ( ref($_[0]) eq 'SCALAR' ) ? 'stringref' : ( $_[0] =~ /^\w{3-6}\:/ ) ? 'uri' : ( length($_[0]) > 1024 ) ? 'string' : 'path'; }
        Please no guess mode, this make a module unsable. I like pathnames > 1024, even if I do not type them, but I may use a module to parse my harddisk with a larger path. Also files on my disk contain ':' and so on.
        Boris

        For your &guess_mode to be safe, use objects instead,

        my $parser1 = HTML::TokePaser::Simple->new(IO::File::->new($file_n +ame)); my $parser2 = HTML::TokePaser::Simple->new($file_handle); my $parser3 = HTML::TokePaser::Simple->new($long_string); my $parser4 = HTML::TokePaser::Simple->new(URI::->new($uri));
        but now you can just as well use the named parameters style instead.

        ihb

        Read argumentation in its context!

Re: HTML::TokeParser::Simple advice requested
by Zaxo (Archbishop) on Aug 13, 2004 at 22:37 UTC

    I like that you've followed the PerlIO convention of taking a reference to a scalar as the marker for an in-memory file. That is an admirably concise notation for something that was previously difficult to do.

    You say it's not intuitive, but nothing is until people get used to it. PerlIO is pretty new to most of us. Your new_from_scalar class method is just a wrapper and doesn't prevent the newer-style new, so I'm not really objecting. I appreciate that you need to reduce the support load.

    After Compline,
    Zaxo

Re: HTML::TokeParser::Simple advice requested
by Juerd (Abbot) on Aug 13, 2004 at 20:38 UTC

    my $parser1 = HTML::TokePaser::Simple->new($file); my $parser2 = HTML::TokePaser::Simple->new($file_handle);

    See also (XML::Parser) Finding and fixing a bug. Please think twice before implementing a hybrid interface. Even if you do know how to code bug-free, do you really need it, or can you add one argument in front that indicated how the second one should be treated?

    Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Re: HTML::TokeParser::Simple advice requested
by Aristotle (Chancellor) on Aug 14, 2004 at 05:46 UTC

    Item 1 is because return_foo is a horrible method name and I, the author, keep forgetting it. Still, this module is popular enough that I worry quite a bit about changing the API, even though I don't plan on removing deprecated methods for a long time.

    Dooo eeeet.

    Seriously. You're not going to upset anyone at all, you'll just get lots of cheering. Go ahead, don't think twice about it.

    new_from_scalar is fine with me as well.

    I don't know how I feel about new_from_fqdn, which is probably supposed to be new_from_url, as FQDN = Fully Qualified Domain Name, which is not nearly the same as a URL. If I'm guessing correctly, that would download the page as well? In that case, I'd prefer new_from_request which takes a HTTP::Request instead. For simple use cases you can just apply HTTP::Request::Common and say

    $p = HTML::TokeParser::Simple->new_from_request GET 'http://www.perlmo +nks.org';

    That would allow seamless application to more complex use cases like POST requests. If you really want to, I guess you could DWIM on whether the parameter is an object or a string.

    Makeshifts last the longest.

Re: HTML::TokeParser::Simple advice requested
by tinita (Parson) on Aug 14, 2004 at 09:51 UTC
    Ultimately, I think we'll have the following constructors:
    my $parser1 = HTML::TokePaser::Simple->new($file); works my $parser2 = HTML::TokePaser::Simple->new($file_handle); my $parser3 = HTML::TokePaser::Simple->new_from_scalar($string);
    why not additionally:
    my $parser5 = HTML::TokePaser::Simple->new($string);
    i think XML::Simple does that. it looks at $string and decides of it's xml or a filename. maybe that's possible for your module, too?
    update: just saw that something like that was already suggested.
Re: HTML::TokeParser::Simple advice requested
by iburrell (Chaplain) on Aug 14, 2004 at 00:04 UTC
    What is a FQDN doing there? If it is fetching HTML from a HTTP URL, then it should take a URL. Not assume a) that it is http, b) only want the root page, and b) that it is a fully-qualified hostname. How about the poor guys that only have IP addresses, or partially qualified host names?
Re: HTML::TokeParser::Simple advice requested
by PodMaster (Abbot) on Aug 14, 2004 at 10:46 UTC
    I'm reminded of Re: RFC: Wider scope for HTML::TokeParser::Simple. Whatever you decide to do, I want my old scripts to work as-is, without any deprecation warnings. If this does not happen, I will not be using/reccomending HTML::TokeParser::Simple anymore (my promise to you :D).

    As for the constructor, forget that new_from stuff. If there are more than 1 arguments ( filename => $foo, scalar => $foo, filehandle => $foo, uri => $foo ), you can intuit the new interface is wanted and freely enable the deprecation warnings.

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.