HTML::TokeParser::Simple advice requested

I'm working on HTML::TokeParser::Simple, version 3.0. Changes:

Deprecate the return_* methods. $token->return_attr($foo); becomes $token->get_attr($foo) or $token->attr($foo).
Have an optional (or automatic?) HTML entity encoding/decoding.
Internals cleanup.
Possibly allow auto-fetching from urls.

Item 1 is because return_foo is a horrible method name and I, the author, keep forgetting it. Still, this module is popular enough that I worry quite a bit about changing the API, even though I don't plan on removing deprecated methods for a long time.

However, the one thing tha keeps bugging me is my desire to solve vegasjoe's problem. This is probably the most common question that I field. If you have an HTML document in a file, it's easy to parse:

my $parser = HTML::TokeParser::Simple->new($file);

Of course, that's because I just inherit from HTML::TokeParser and don't worry about what's behind the scenes. However, what I really want to do is make this work:

my $parser = HTML::TokeParser::Simple->new($html_in_a_string);

Currently, people seem to get really confused because it's not intuitive to take a reference to a scalar to parse in-memory HTML. I'll probably do something like:

sub new_from_scalar {
    my ($class, $scalar) = @_;
    return $class->new(\$scalar);
}
[download]

Ultimately, I think we'll have the following constructors:

my $parser1 = HTML::TokePaser::Simple->new($file); works
my $parser2 = HTML::TokePaser::Simple->new($file_handle);
my $parser3 = HTML::TokePaser::Simple->new_from_scalar($string); 
my $parser4 = HTML::TokePaser::Simple->new_from_fqdn($fqdn);
[download]

Feedback and advice welcome.

Cheers,
Ovid

New address of my CGI Course.

Comment on HTML::TokeParser::Simple advice requested Select or Download Code

Replies are listed 'Best First'.

Re: HTML::TokeParser::Simple advice requested
by borisz (Canon) on Aug 13, 2004 at 19:00 UTC

my $parser3 = HTML::TokePaser::Simple->new_from_scalar($string);
[download]

my $p = HTML::TokePaser::Simple->new(\$string);

HTML::TokePaser::Simple

$p->get_token

my $p = HTML::TokePaser::Simple->new_from_fqdn($fqdn);

use LWP::Simple;
my $content = get 'http://www.perlmonks.org/';
die "Can't get content" unless defined $content;
HTML::TokePaser::Simple->new(\$content);
[download]

HTML::TokePaser::Simple->new({ url => 'http://www.perlmonks.org/' });
HTML::TokePaser::Simple->new({ scalar => $content });
[download]

Boris

[reply]
[d/l]
[select]

Re^2: HTML::TokeParser::Simple advice requested

by simonm (Vicar) on Aug 13, 2004 at 21:55 UTC

Or, add a hashref with lots of new options instead of more new like methods.

++, but there's no need for a hash reference:


my $parser1 = HTML::TokePaser::Simple->new(path => $file_name); 
my $parser2 = HTML::TokePaser::Simple->new(handle => $file_handle);
my $parser3 = HTML::TokePaser::Simple->new(string => $string); 
my $parser4 = HTML::TokePaser::Simple->new(fqdn => $fqdn);
[download]

And is there some reason a bit more auto-sensing couldn't be added to make these be typically implicit? Sure, to be safe, you'd want to use the two-argument form above, but for one-offs you could use the short form.


my $parser1 = HTML::TokePaser::Simple->new($file_name); 
my $parser2 = HTML::TokePaser::Simple->new($file_handle);
my $parser3 = HTML::TokePaser::Simple->new($long_string); 
my $parser4 = HTML::TokePaser::Simple->new($uri);

  sub new {
    my $class = shift;
    my ($mode, $target) = (@_ == 1 ? $class->guess_mode($_[0]) : (), @
+_);
    my $source = ( $mode eq 'path' )      ? $target :
                 ( $mode eq 'stringref' ) ? $target :
                 ( $mode eq 'string' )    ? \$target : 
                                            do {
                                              my $method = "source_for
+_$mode";
                                              $class->$method( $target
+ )
                                            };
    $class->SUPER::new( $source );
  }

  sub guess_mode {
    my $class = shift;
    ( ref($_[0]) =~ /^IO|FileHandle/) ? 'handle' : 
    ( ref($_[0]) eq 'SCALAR' )        ? 'stringref' : 
    ( $_[0] =~ /^\w{3-6}\:/ )         ? 'uri' : 
    ( length($_[0]) > 1024 )          ? 'string' : 
                                        'path';
  }

  sub source_for_uri {
    my ($class, $uri) = @_;
    # ...
  }
[download]

[reply]
[d/l]
[select]

Re^3: HTML::TokeParser::Simple advice requested

by borisz (Canon) on Aug 13, 2004 at 22:11 UTC

Or, add a hashref with lots of new options instead of more new like methods. ++, but there's no need for a hash reference:
my $parser1 = HTML::TokePaser::Simple->new(path => $file_name); my $parser2 = HTML::TokePaser::Simple->new(handle => $file_handle); my $parser3 = HTML::TokePaser::Simple->new(string => $string); my $parser4 = HTML::TokePaser::Simple->new(fqdn => $fqdn);
[download]

sub guess_mode { my $class = shift; ( ref($_[0]) =~ /^IO|FileHandle/) ? 'handle' : ( ref($_[0]) eq 'SCALAR' ) ? 'stringref' : ( $_[0] =~ /^\w{3-6}\:/ ) ? 'uri' : ( length($_[0]) > 1024 ) ? 'string' : 'path'; }
[download]

Please

Boris

[reply]
[d/l]
[select]

Re^3: HTML::TokeParser::Simple advice requested

by ihb (Deacon) on Aug 15, 2004 at 23:05 UTC

For your &guess_mode to be safe, use objects instead,

    my $parser1 = HTML::TokePaser::Simple->new(IO::File::->new($file_n
+ame));
    my $parser2 = HTML::TokePaser::Simple->new($file_handle);
    my $parser3 = HTML::TokePaser::Simple->new($long_string);
    my $parser4 = HTML::TokePaser::Simple->new(URI::->new($uri));
[download]

ihb

Read argumentation in its context!

[reply]
[d/l]
[select]

Re: HTML::TokeParser::Simple advice requested
by Zaxo (Archbishop) on Aug 13, 2004 at 22:37 UTC

I like that you've followed the PerlIO convention of taking a reference to a scalar as the marker for an in-memory file. That is an admirably concise notation for something that was previously difficult to do.

You say it's not intuitive, but nothing is until people get used to it. PerlIO is pretty new to most of us. Your new_from_scalar class method is just a wrapper and doesn't prevent the newer-style new, so I'm not really objecting. I appreciate that you need to reduce the support load.

After Compline,
Zaxo

[reply]

Re: HTML::TokeParser::Simple advice requested
by Juerd (Abbot) on Aug 13, 2004 at 20:38 UTC

my $parser1 = HTML::TokePaser::Simple->new($file); my $parser2 = HTML::TokePaser::Simple->new($file_handle);

See also (XML::Parser) Finding and fixing a bug. Please think twice before implementing a hybrid interface. Even if you do know how to code bug-free, do you really need it, or can you add one argument in front that indicated how the second one should be treated?

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

[reply]

Re^2: HTML::TokeParser::Simple advice requested

by Ovid (Cardinal) on Aug 13, 2004 at 20:48 UTC

Actually, the particular example you cite is implemented in HTML::TokeParser. As I'm inheriting, I get that for free. I should think twice before violating the Liskov Substitution Principle :)

Cheers,
Ovid

New address of my CGI Course.

[reply]

Re: HTML::TokeParser::Simple advice requested
by Aristotle (Chancellor) on Aug 14, 2004 at 05:46 UTC

Item 1 is because return_foo is a horrible method name and I, the author, keep forgetting it. Still, this module is popular enough that I worry quite a bit about changing the API, even though I don't plan on removing deprecated methods for a long time.

Dooo eeeet.

Seriously. You're not going to upset anyone at all, you'll just get lots of cheering. Go ahead, don't think twice about it.

new_from_scalar is fine with me as well.

I don't know how I feel about new_from_fqdn, which is probably supposed to be new_from_url, as FQDN = Fully Qualified Domain Name, which is not nearly the same as a URL. If I'm guessing correctly, that would download the page as well? In that case, I'd prefer new_from_request which takes a HTTP::Request instead. For simple use cases you can just apply HTTP::Request::Common and say

$p = HTML::TokeParser::Simple->new_from_request GET 'http://www.perlmo
+nks.org';
[download]

That would allow seamless application to more complex use cases like POST requests. If you really want to, I guess you could DWIM on whether the parameter is an object or a string.

Makeshifts last the longest.

[reply]
[d/l]

Re: HTML::TokeParser::Simple advice requested
by tinita (Parson) on Aug 14, 2004 at 09:51 UTC

Ultimately, I think we'll have the following constructors:
my $parser1 = HTML::TokePaser::Simple->new($file); works my $parser2 = HTML::TokePaser::Simple->new($file_handle); my $parser3 = HTML::TokePaser::Simple->new_from_scalar($string);
[download]

my $parser5 = HTML::TokePaser::Simple->new($string);

[reply]
[d/l]
[select]

Re: HTML::TokeParser::Simple advice requested
by iburrell (Chaplain) on Aug 14, 2004 at 00:04 UTC

What is a FQDN doing there? If it is fetching HTML from a HTTP URL, then it should take a URL. Not assume a) that it is http, b) only want the root page, and b) that it is a fully-qualified hostname. How about the poor guys that only have IP addresses, or partially qualified host names?

[reply]

Re: HTML::TokeParser::Simple advice requested
by PodMaster (Abbot) on Aug 14, 2004 at 10:46 UTC

Re: RFC: Wider scope for HTML::TokeParser::Simple

As for the constructor, forget that new_from stuff. If there are more than 1 arguments ( filename => $foo, scalar => $foo, filehandle => $foo, uri => $foo ), you can intuit the new interface is wanted and freely enable the deprecation warnings.

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]