No. It has some limitations that doesn't make it a general replacement (I can't recall what those limitations are, though). However, for most small projects, it should be fine. I've rummaged around and found a version (not sure if this is the one I was intending. If you want to play around, here it is. Full POD with code samples is included.
##################
package HTML::TokeParser::Easy;
##################
use strict;
use Carp;
use HTML::TokeParser;
use vars qw/ @ISA $VERSION $AUTOLOAD /;
$VERSION = '1.0';
@ISA = qw/ HTML::TokeParser /;
use constant START_TAG => 'S';
use constant END_TAG => 'E';
use constant TEXT => 'T';
use constant COMMENT => 'C';
use constant DECLARATION => 'D';
use constant PROCESS_INSTRUCTION => 'PI';
my %token_spec = (
S => {
_name => 'START_TAG',
tag => 1,
attr => 2,
attrseq => 3,
text => 4
},
E => {
_name => 'END_TAG',
tag => 1,
text => 2
},
T => {
_name => 'TEXT',
text => 1
},
C => {
_name => 'COMMENT',
text => 1
},
D => {
_name => 'DECLARATION',
text => 1
},
PI => {
_name => 'PROCESS_INSTRUCTION',
token0 => 1,
text => 2
} );
sub AUTOLOAD {
no strict 'refs';
my ($self, $token) = @_;
# was it an is_... method?
if ( $AUTOLOAD =~ /.*::is_(\w+)/ )
{
my $token_type = uc $1;
my $tag = &$token_type;
*{ $AUTOLOAD } = sub { return $_[ 1 ]->[ 0 ] eq $tag ? 1 : 0 }
+;
return &$AUTOLOAD;
}
elsif ( $AUTOLOAD =~ /.*::return_(\w+)/ )
{
# was it a return_... method?
my $token_attr = $1;
*{ $AUTOLOAD } =
sub
{
my $attr = $_[ 1 ]->[ 0 ];
if ( exists $token_spec{ $attr } and exists $token_spe
+c{ $attr }{ $token_attr } )
{
return $_[ 1 ]->[ $token_spec{ $attr }{ $token_att
+r } ];
}
else
{
if ( ! exists $token_spec{ $attr } )
{
carp "No such token: '$attr'";
} else {
carp "No such attribute: '$token_attr' for $to
+ken_spec{ $attr }{ _name }";
}
}
};
return &$AUTOLOAD;
}
else
{
# Yo! You can't do that!
croak "No such method: $AUTOLOAD";
}
}
sub DESTROY {};
__END__
=head1 NAME
HTML::TokeParser::Easy - simple method access to TokeParser tokens (no
+ more memorizing array
indices).
=head1 SYNOPSIS
use HTML::TokeParser::Easy;
my $p = HTML::TokeParser::Easy->new( $somefile );
while ( my $token = $parser->get_token )
{
# This prints all text in an HTML doc (i.e., it strips the HTML)
next if ! $parser->is_text( $token );
print $parser->return_text( $token );
}
=head1 DESCRIPTION
C<HTML::TokeParser> is a fairly common method of parsing HTML. Howeve
+r, the tokens returned
are not exactly intuitive to parse:
["S", $tag, $attr, $attrseq, $text]
["E", $tag, $text]
["T", $text, $is_data]
["C", $text]
["D", $text]
["PI", $token0, $text]
To simplify this, C<HTML::TokeParser::Easy> allows the user ask more i
+ntuitive (read: more
self-documenting) questions about the tokens returned. Specifically,
+there are 6 C<is_foo>
type methods and 6 C<return_bar> type methods. The C<is_> methods all
+ow you to determine
the token type and the C<return_> methods get the data that you need.
Since this is a subclass of C<HTML::TokeParser>, all C<HTML::TokeParse
+r> methods are available.
To truly appreciate the power of this module, please read the document
+ation for C<HTML::TokeParser>
and C<HTML::Parser>.
The following will be brief descriptions of the available methods foll
+owed by examples.
=head1 C<is_> Methods
=head2 Note:
Due to the way that C<AUTOLOAD> has been coded, the portion of the C<i
+s_> methods after the
C<is_> part is case-insensitive. For example, the following lines of
+code are identical:
$parser->is_start_tag( $token );
$parser->is_START_TAG( $token );
$parser->is_stArt_tAg( $token );
Yes, I could have done something about that, but why bother?
=over 4
=item 1 C<is_start_tag>
Use this to determine if you have a start tag.
=item 2 C<is_end_tag.>
Use this to determine if you have an end tag.
=item 3 C<is_text>
Use this to determine if you have text. Note that this is I<not> to b
+e confused with the
C<return_text> method described below! C<is_text> will identify text
+that the user typically
sees display in the Web browser.
=item 4 C<is_comment>
Are you still reading this? Nobody reads POD. Don't you know you're
+supposed to go to CLPM,
ask a question that's answered in the POD and get flamed? It's a rite
+ of passage.
Really.
C<is_comment> is used to identify comments. See the HTML::Parser docu
+mentation for more
information about comments. There's more than you might think.
=item 5 C<is_declaration>
This will match the DTD at the top of your HTML. (You I<do> use DTD's,
+ don't you?)
=item 6 C<is_process_instruction>
Process Instructions are from XML. This is very handy if you need to
+parse out PHP and similar
things with a parser.
=back
=head1 The 6 C<return_> methods
=head2 Note:
As noted for the 6 C<is_> methods, these methods are case-insensitive
+after the
C<return_> part.
=over 4
=item 1 C<return_tag>
Do you have a start tag or end tag? This will return the type (lower
+case).
=item 2 C<return_attr>
If you have a start tag, this will return a hash ref with the attribut
+e names as keys
and the values as the values.
=item 3 C<return_attrseq>
For a start tag, this is an array reference with the sequence of the a
+ttributes, if any.
=item 4 C<return_text>
This is the exact text of whatever the token is representing.
=item 5 C<return_is_data>
This text is in a CDATA section.
=item 6 C<return_token0>
For processing instructions, this will return the token found immediat
+ely after the opening
tag. For \<?php, "php" will be returned.
=back
=head1 Examples
=head2 Finding comments
For some strange reason, your Pointy-Haired Boss (PHB) is convinced th
+at the graphics department
is making fun of him by embedding rude things about him in HTML commen
+ts. You need to get
all HTML comments from the HTML.
use strict;
use HTML::TokeParser::Easy;
my @html_docs = glob( "*.html" );
open PHB, "> phbreport.txt" or die "Cannot open phbreport for writing
+: $!";
foreach my $doc ( @html_docs )
{
print "Processing $doc\n";
my $p = HTML::TokeParser::Easy->new( $doc );
while ( my $token = $p->get_token )
{
next if ! $p->is_comment( $token );
print PHB $p->return_text( $token ), "\n";
}
}
close PHB;
=head2 Stripping Comments
Uh oh. Turns out that your PHB was right for a change. Many of the c
+omments in the HTML weren't
very polite. Since your entire graphics department was just fired, it
+ falls on you need to strip
those comments from the HTML.
use strict;
use HTML::TokeParser::Easy;
my $new_folder = 'no_comment/';
my @html_docs = glob( "*.html" );
foreach my $doc ( @html_docs )
{
print "Processing $doc\n";
my $new_file = "$new_folder$doc";
open PHB, "> $new_file" or die "Cannot open $new_file for writing
+: $!";
my $p = HTML::TokeParser::Easy->new( $doc );
while ( my $token = $p->get_token )
{
next if $p->is_comment( $token );
print PHB $p->return_text( $token );
}
close PHB;
}
=head2 Changing form tags
Your company was foo.com and now is bar.com. Unfortunately, whoever w
+rote your HTML decided to
hardcode "http://www.foo.com/" into the C<action> attribute of the for
+m tags. You need to change
it to "http://www.bar.com/".
use strict;
use HTML::TokeParser::Easy;
my $new_folder = 'new_html/';
my @html_docs = glob( "*.html" );
foreach my $doc ( @html_docs )
{
print "Processing $doc\n";
my $new_file = "$new_folder$doc";
open FILE, "> $new_file" or die "Cannot open $new_file for writin
+g: $!";
my $p = HTML::TokeParser::Easy->new( $doc );
while ( my $token = $p->get_token )
{
if ( $p->is_start_tag( $token ) and $p->return_tag( $token )
+eq 'form' )
{
my $form_tag = new_form_tag( $p->return_attr( $token ), $
+p->return_attrseq( $token ) );
print FILE $form_tag;
}
else
{
print FILE $p->return_text( $token );
}
}
close FILE;
}
sub new_form_tag {
my ( $attr, $attrseq ) = @_;
if ( exists $attr->{ action } )
{
$attr->{ action } =~ s/www\.foo\.com/www.bar.com/;
}
my $tag = '';
foreach ( @$attrseq )
{
$tag .= "$_=\"$attr->{ $_ }\" ";
}
$tag = "<form $tag>";
return $tag;
}
=head1 COPYRIGHT
Copyright (c) 2001 Curtis "Ovid" Poe. All rights reserved.
This program is free software; you may redistribute it and/or modify i
+t under
the same terms as Perl itself
=head1 AUTHOR
Curtis "Ovid" Poe L<poec@yahoo.com>
=head1 BUGS
2001/10/04 There are no known bugs at this time.
Use of C<$HTML::Parser::VERSION> which is less than 3.25 may result in
+ incorrect behavior as older
versions do not always handle XHTML correctly. It is the programmer's
+ responsibility to verify
that the behavior of this code matches the programmer's needs.
Address bug reports and comments to: L<poec@yahoo.com>. When sending
+bug reports,
please provide the version of HTML::Parser, HTML::TokeParser, HTML::To
+keParser::Easy,
the version of Perl, and the version of the operating system you are u
+sing.
=head1 BUGS
2001/10/04 There are no known bugs at this time.
Use of C<$HTML::Parser::VERSION> which is less than 3.25 may result in
+ incorrect behavior as older
versions do not always handle XHTML correctly. It is the programmer's
+ responsibility to verify
that the behavior of this code matches the programmer's needs.
=cut
Cheers,
Ovid
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats. |