I've had success avoiding script and css data within web pages using
HTML::TokeParser as follows:
# this assumes an html file in @ARGV or on STDIN:
my $src;
{ # read the entire HTML input stream as one contiguous string:
local $/ = undef;
$src = <>;
}
my $htm = HTML::TokeParser->new( \$src );
my $inscript = 0;
my $ignore = join '|', qw/script style cssheader/;
while ( my $tkn = $htm->get_token )
{
if ( $$tkn[0] eq 'S' and $$tkn[1] =~ /^(?:$ignore)$/ )
{
$inscript++; # skip anything having to do with scripts, styl
+es or css
next;
}
elsif ( $$tkn[0] eq 'E' and $$tkn[1] =~ /^(?:$ignore)$/ )
{
$inscript--;
next;
}
elsif ( $$tkn[0] eq 'T' and ! $inscript ) {
# we have text that is not part of scripting or styling,
# so do something with this text...
}
}
This assumes the html input is well formed with respect to script, style and cssheader tags. Note that HTML::TokeParser isn't really any more complicated than HTML::TokeParser::Simple -- you just have to know the structure of the tokens that it returns, so that you can set up handlers for the different types (start tags flagged by
$$tkn[0] eq 'S', end tags by 'E', text data by 'T', etc, with tag name or text content stored in
$$tkn[1]).