in reply to (Ovid) Re: Script (ovid's) explanation
in thread Script (ovid's) explanation

Ovid: is your alternative interface to HTML::TokeParser in CPAN?

..Guv

  • Comment on Re: (Ovid) Re: Script (ovid's) explanation

Replies are listed 'Best First'.
(Ovid) Re(3 or 4): Script (ovid's) explanation
by Ovid (Cardinal) on Feb 27, 2002 at 00:33 UTC

    No. It has some limitations that doesn't make it a general replacement (I can't recall what those limitations are, though). However, for most small projects, it should be fine. I've rummaged around and found a version (not sure if this is the one I was intending. If you want to play around, here it is. Full POD with code samples is included.

    ################## package HTML::TokeParser::Easy; ################## use strict; use Carp; use HTML::TokeParser; use vars qw/ @ISA $VERSION $AUTOLOAD /; $VERSION = '1.0'; @ISA = qw/ HTML::TokeParser /; use constant START_TAG => 'S'; use constant END_TAG => 'E'; use constant TEXT => 'T'; use constant COMMENT => 'C'; use constant DECLARATION => 'D'; use constant PROCESS_INSTRUCTION => 'PI'; my %token_spec = ( S => { _name => 'START_TAG', tag => 1, attr => 2, attrseq => 3, text => 4 }, E => { _name => 'END_TAG', tag => 1, text => 2 }, T => { _name => 'TEXT', text => 1 }, C => { _name => 'COMMENT', text => 1 }, D => { _name => 'DECLARATION', text => 1 }, PI => { _name => 'PROCESS_INSTRUCTION', token0 => 1, text => 2 } ); sub AUTOLOAD { no strict 'refs'; my ($self, $token) = @_; # was it an is_... method? if ( $AUTOLOAD =~ /.*::is_(\w+)/ ) { my $token_type = uc $1; my $tag = &$token_type; *{ $AUTOLOAD } = sub { return $_[ 1 ]->[ 0 ] eq $tag ? 1 : 0 } +; return &$AUTOLOAD; } elsif ( $AUTOLOAD =~ /.*::return_(\w+)/ ) { # was it a return_... method? my $token_attr = $1; *{ $AUTOLOAD } = sub { my $attr = $_[ 1 ]->[ 0 ]; if ( exists $token_spec{ $attr } and exists $token_spe +c{ $attr }{ $token_attr } ) { return $_[ 1 ]->[ $token_spec{ $attr }{ $token_att +r } ]; } else { if ( ! exists $token_spec{ $attr } ) { carp "No such token: '$attr'"; } else { carp "No such attribute: '$token_attr' for $to +ken_spec{ $attr }{ _name }"; } } }; return &$AUTOLOAD; } else { # Yo! You can't do that! croak "No such method: $AUTOLOAD"; } } sub DESTROY {}; __END__ =head1 NAME HTML::TokeParser::Easy - simple method access to TokeParser tokens (no + more memorizing array indices). =head1 SYNOPSIS use HTML::TokeParser::Easy; my $p = HTML::TokeParser::Easy->new( $somefile ); while ( my $token = $parser->get_token ) { # This prints all text in an HTML doc (i.e., it strips the HTML) next if ! $parser->is_text( $token ); print $parser->return_text( $token ); } =head1 DESCRIPTION C<HTML::TokeParser> is a fairly common method of parsing HTML. Howeve +r, the tokens returned are not exactly intuitive to parse: ["S", $tag, $attr, $attrseq, $text] ["E", $tag, $text] ["T", $text, $is_data] ["C", $text] ["D", $text] ["PI", $token0, $text] To simplify this, C<HTML::TokeParser::Easy> allows the user ask more i +ntuitive (read: more self-documenting) questions about the tokens returned. Specifically, +there are 6 C<is_foo> type methods and 6 C<return_bar> type methods. The C<is_> methods all +ow you to determine the token type and the C<return_> methods get the data that you need. Since this is a subclass of C<HTML::TokeParser>, all C<HTML::TokeParse +r> methods are available. To truly appreciate the power of this module, please read the document +ation for C<HTML::TokeParser> and C<HTML::Parser>. The following will be brief descriptions of the available methods foll +owed by examples. =head1 C<is_> Methods =head2 Note: Due to the way that C<AUTOLOAD> has been coded, the portion of the C<i +s_> methods after the C<is_> part is case-insensitive. For example, the following lines of +code are identical: $parser->is_start_tag( $token ); $parser->is_START_TAG( $token ); $parser->is_stArt_tAg( $token ); Yes, I could have done something about that, but why bother? =over 4 =item 1 C<is_start_tag> Use this to determine if you have a start tag. =item 2 C<is_end_tag.> Use this to determine if you have an end tag. =item 3 C<is_text> Use this to determine if you have text. Note that this is I<not> to b +e confused with the C<return_text> method described below! C<is_text> will identify text +that the user typically sees display in the Web browser. =item 4 C<is_comment> Are you still reading this? Nobody reads POD. Don't you know you're +supposed to go to CLPM, ask a question that's answered in the POD and get flamed? It's a rite + of passage. Really. C<is_comment> is used to identify comments. See the HTML::Parser docu +mentation for more information about comments. There's more than you might think. =item 5 C<is_declaration> This will match the DTD at the top of your HTML. (You I<do> use DTD's, + don't you?) =item 6 C<is_process_instruction> Process Instructions are from XML. This is very handy if you need to +parse out PHP and similar things with a parser. =back =head1 The 6 C<return_> methods =head2 Note: As noted for the 6 C<is_> methods, these methods are case-insensitive +after the C<return_> part. =over 4 =item 1 C<return_tag> Do you have a start tag or end tag? This will return the type (lower +case). =item 2 C<return_attr> If you have a start tag, this will return a hash ref with the attribut +e names as keys and the values as the values. =item 3 C<return_attrseq> For a start tag, this is an array reference with the sequence of the a +ttributes, if any. =item 4 C<return_text> This is the exact text of whatever the token is representing. =item 5 C<return_is_data> This text is in a CDATA section. =item 6 C<return_token0> For processing instructions, this will return the token found immediat +ely after the opening tag. For \<?php, "php" will be returned. =back =head1 Examples =head2 Finding comments For some strange reason, your Pointy-Haired Boss (PHB) is convinced th +at the graphics department is making fun of him by embedding rude things about him in HTML commen +ts. You need to get all HTML comments from the HTML. use strict; use HTML::TokeParser::Easy; my @html_docs = glob( "*.html" ); open PHB, "> phbreport.txt" or die "Cannot open phbreport for writing +: $!"; foreach my $doc ( @html_docs ) { print "Processing $doc\n"; my $p = HTML::TokeParser::Easy->new( $doc ); while ( my $token = $p->get_token ) { next if ! $p->is_comment( $token ); print PHB $p->return_text( $token ), "\n"; } } close PHB; =head2 Stripping Comments Uh oh. Turns out that your PHB was right for a change. Many of the c +omments in the HTML weren't very polite. Since your entire graphics department was just fired, it + falls on you need to strip those comments from the HTML. use strict; use HTML::TokeParser::Easy; my $new_folder = 'no_comment/'; my @html_docs = glob( "*.html" ); foreach my $doc ( @html_docs ) { print "Processing $doc\n"; my $new_file = "$new_folder$doc"; open PHB, "> $new_file" or die "Cannot open $new_file for writing +: $!"; my $p = HTML::TokeParser::Easy->new( $doc ); while ( my $token = $p->get_token ) { next if $p->is_comment( $token ); print PHB $p->return_text( $token ); } close PHB; } =head2 Changing form tags Your company was foo.com and now is bar.com. Unfortunately, whoever w +rote your HTML decided to hardcode "http://www.foo.com/" into the C<action> attribute of the for +m tags. You need to change it to "http://www.bar.com/". use strict; use HTML::TokeParser::Easy; my $new_folder = 'new_html/'; my @html_docs = glob( "*.html" ); foreach my $doc ( @html_docs ) { print "Processing $doc\n"; my $new_file = "$new_folder$doc"; open FILE, "> $new_file" or die "Cannot open $new_file for writin +g: $!"; my $p = HTML::TokeParser::Easy->new( $doc ); while ( my $token = $p->get_token ) { if ( $p->is_start_tag( $token ) and $p->return_tag( $token ) +eq 'form' ) { my $form_tag = new_form_tag( $p->return_attr( $token ), $ +p->return_attrseq( $token ) ); print FILE $form_tag; } else { print FILE $p->return_text( $token ); } } close FILE; } sub new_form_tag { my ( $attr, $attrseq ) = @_; if ( exists $attr->{ action } ) { $attr->{ action } =~ s/www\.foo\.com/www.bar.com/; } my $tag = ''; foreach ( @$attrseq ) { $tag .= "$_=\"$attr->{ $_ }\" "; } $tag = "<form $tag>"; return $tag; } =head1 COPYRIGHT Copyright (c) 2001 Curtis "Ovid" Poe. All rights reserved. This program is free software; you may redistribute it and/or modify i +t under the same terms as Perl itself =head1 AUTHOR Curtis "Ovid" Poe L<poec@yahoo.com> =head1 BUGS 2001/10/04 There are no known bugs at this time. Use of C<$HTML::Parser::VERSION> which is less than 3.25 may result in + incorrect behavior as older versions do not always handle XHTML correctly. It is the programmer's + responsibility to verify that the behavior of this code matches the programmer's needs. Address bug reports and comments to: L<poec@yahoo.com>. When sending +bug reports, please provide the version of HTML::Parser, HTML::TokeParser, HTML::To +keParser::Easy, the version of Perl, and the version of the operating system you are u +sing. =head1 BUGS 2001/10/04 There are no known bugs at this time. Use of C<$HTML::Parser::VERSION> which is less than 3.25 may result in + incorrect behavior as older versions do not always handle XHTML correctly. It is the programmer's + responsibility to verify that the behavior of this code matches the programmer's needs. =cut

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.