Ovid has asked for the wisdom of the Perl Monks concerning the following question:

I'm playing around with different methods of subclassing to get a feel for user interface issues and I have a question of "style".

Right now, I'm subclassing HTML::TokeParser as HTML::TokeParser::Easy and I thought it would be interesting to be able to do the following:

my $parser = HTML::TokeParser::Easy->new( $some_html ); while ( my $token = $parser->get_token ) { # This prints all text in an HTML doc (i.e., it strips the HTML) next if ! $token->is_text; print $token->return_text; }

Unfortunately, the only way I can think to do that is by blessing the $token and turning it into an object. That seems like there would be a lot of overhead involved.

sub get_token { my $self = shift; my $class = ref $self; my $token = $self->SUPER::get_token; return undef if ! defined $token; bless $token, $class; } # create appropriate methods...

The other strategy I thought of was to allow the user to do the following:

while ( my $token = $parser->get_token ) { # This prints all text in an HTML doc (i.e., it strips the HTML) next if ! $parser->is_text( $token ); print $parser->return_text( $token ); }

I used the following to do this (using AUTOLOAD to simplify things):

################## package HTML::TokeParser::Easy; ################## use strict; use HTML::TokeParser; use vars qw/ @ISA $VERSION $AUTOLOAD /; $VERSION = '1.0'; @ISA = qw/ HTML::TokeParser /; use constant START_TAG => 'S'; use constant END_TAG => 'E'; use constant TEXT => 'T'; use constant COMMENT => 'C'; use constant DECLARATION => 'D'; my %token_spec = ( S => { _name => 'START_TAG', tag => 1, attr => 2, attrseq => 3, text => 4 }, E => { _name => 'END_TAG', tag => 1, text => 2 }, T => { _name => 'TEXT', text => 1 }, C => { _name => 'COMMENT', text => 1 }, D => { _name => 'DECLARATION', text => 1 } ); sub AUTOLOAD { no strict 'refs'; my ($self, $token) = @_; # was it an is_... method? if ( $AUTOLOAD =~ /.*::is_(\w+)/ ) { my $token_type = uc $1; my $tag = &$token_type; *{ $AUTOLOAD } = sub { return $_[ 1 ]->[ 0 ] eq $tag ? 1 : 0 } +; return &$AUTOLOAD; } elsif ( $AUTOLOAD =~ /.*::return_(\w+)/ ) { # was it a return_... method? my $token_attr = $1; *{ $AUTOLOAD } = sub { my $attr = $_[ 1 ]->[ 0 ]; if ( exists $token_spec{ $attr }{ $token_attr } ) { return $_[ 1 ]->[ $token_spec{ $attr }{ $token_attr + } ]; } else { warn "No such attribute: '$token_attr' for $token_s +pec{ $attr }{ _name }"; } }; return &$AUTOLOAD; } else { # Yo! You can't do that! die "No such method: $AUTOLOAD"; } }

Blessing the tokens makes the interface seem much more intuitive, but creating so many objects seems like it's going to be wasteful and slow. The second method works fine, but the interface seems a bit cumbersome. Is there anyway I can get the syntax of the first method without the overhead?

Cheers,
Ovid

Vote for paco!

Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Replies are listed 'Best First'.
Re: Subclassing strategies
by chromatic (Archbishop) on Oct 01, 2001 at 10:33 UTC
    It doesn't seem like a lot of overhead to me. You have one additional method call, a ref, and a bless. You'll have to pay allocation penalties (such as they are) for any token you create anyway, and you'll have method calls (the real performance killer) anyway as well.

    If you have much more than a dozen additional ops, I'd be very surprised. (Though you could speed up the first, uncached method dispatch by automagically importing is_text() and return_text() into your subclass.)

    I'd pay the price for convenience. It'd take a huge document for you to notice the cost, in my opinion. ©

Re: Subclassing strategies
by BrentDax (Hermit) on Oct 01, 2001 at 12:06 UTC
    <OPTIMISM>
    This is the sort of thing Perl 6 will have properties for. In HTML::TokeParser::Easy's get_token method, you would just mark the return value with 'is tag'. Then, your loop would just be:
    while($toke=$parser.get_token) { next if $toke.tag; print $toke }
    Token objects need not apply. Much cleaner, don't you think?
    </OPTIMISM>

    =cut
    --Brent Dax
    There is no sig.

Re: Subclassing strategies
by Zaxo (Archbishop) on Oct 01, 2001 at 15:29 UTC

    I don't think subclassing HTML::TokeParser for token is a good fit. It appears to violate the 'is-a' rule for inheritance. Perhaps the parser object should be regarded as a queue of tokens, or maybe a token factory.

    A look at the HTML::Parser<--HTML::PullParser<--HTML::TokeParser namespace suggests that HTML::Token might be the level for the Token object.

    After Compline,
    Zaxo

Re: Subclassing strategies
by tachyon (Chancellor) on Oct 01, 2001 at 17:12 UTC

    Of course just using the Parser class directly you could:

    package TextStrip; use base 'HTML::Parser'; sub text { $strip_text .= $_[1] } my $parser = new TextStrip; $parser->parse_file(\*DATA) && print $strip_text; __DATA__ <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=windows-12 +52"> <title>Index</title> </head> <body> <h1>Hello World</h1> <p>Just Another <p>Parser Hack </body> </html>

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Subclassing strategies
by theorbtwo (Prior) on Oct 02, 2001 at 03:23 UTC
    chromatic's probably right. That being said, I wonder if you could just return a hash, possibly with overloaded stringification. (If you have to overload, though, it's probably worse then just creating a new object.)
    Thanks,
    James Mastros,
    Just Another Perl Initate