Re: pulling just text from a url

in reply to pulling just text from a url

I've had success avoiding script and css data within web pages using HTML::TokeParser as follows:

# this assumes an html file in @ARGV or on STDIN:

my $src;
{  # read the entire HTML input stream as one contiguous string:
    local $/ = undef;
    $src = <>;
}

my $htm = HTML::TokeParser->new( \$src );

my $inscript = 0;
my $ignore = join '|', qw/script style cssheader/;

while ( my $tkn = $htm->get_token )
{
    if ( $$tkn[0] eq 'S' and $$tkn[1] =~ /^(?:$ignore)$/ )
    {
        $inscript++;   # skip anything having to do with scripts, styl
+es or css
        next;
    }
    elsif ( $$tkn[0] eq 'E' and $$tkn[1] =~ /^(?:$ignore)$/ )
    {
        $inscript--;
        next;
    }
    elsif ( $$tkn[0] eq 'T' and ! $inscript ) {
        # we have text that is not part of scripting or styling,
        # so do something with this text...
    }
}
[download]

This assumes the html input is well formed with respect to script, style and cssheader tags. Note that HTML::TokeParser isn't really any more complicated than HTML::TokeParser::Simple -- you just have to know the structure of the tokens that it returns, so that you can set up handlers for the different types (start tags flagged by $$tkn[0] eq 'S', end tags by 'E', text data by 'T', etc, with tag name or text content stored in $$tkn[1]).

In Section Seekers of Perl Wisdom