I've had success avoiding script and css data within web pages using HTML::TokeParser as follows:
# this assumes an html file in @ARGV or on STDIN:
my $src;
{ # read the entire HTML input stream as one contiguous string:
local $/ = undef;
$src = <>;
}
my $htm = HTML::TokeParser->new( \$src );
my $inscript = 0;
my $ignore = join '|', qw/script style cssheader/;
while ( my $tkn = $htm->get_token )
{
if ( $$tkn[0] eq 'S' and $$tkn[1] =~ /^(?:$ignore)$/ )
{
$inscript++; # skip anything having to do with scripts, styl
+es or css
next;
}
elsif ( $$tkn[0] eq 'E' and $$tkn[1] =~ /^(?:$ignore)$/ )
{
$inscript--;
next;
}
elsif ( $$tkn[0] eq 'T' and ! $inscript ) {
# we have text that is not part of scripting or styling,
# so do something with this text...
}
}
This assumes the html input is well formed with respect to script, style and cssheader tags. Note that HTML::TokeParser isn't really any more complicated than HTML::TokeParser::Simple -- you just have to know the structure of the tokens that it returns, so that you can set up handlers for the different types (start tags flagged by $$tkn[0] eq 'S', end tags by 'E', text data by 'T', etc, with tag name or text content stored in $$tkn[1]).
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|