pulling just text from a url

coldfingertips has asked for the wisdom of the Perl Monks concerning the following question:

I need an accurate way to pull just the readable text from a web page. I was told HTML::TokeParser / Simple would work. The thing is, it's bringing back some css and javascript tags too, including Google Ad source code.

On top of this, there is a lot of and li tags in the page dump, too. I can filter these out I suppose in regexes, but there's no way I can account for everything that this module misses.

Also, it misprints some data, too. The below script prints '0Items in cart' for example, there IS a space there on the page.

Is there an accurate way to do this?

#!/usr/bin/perl

use warnings;
use strict;


my $url = "http://www.sensationalscentsonline.com";

my $page_source;

 use HTML::TokeParser::Simple;
 my $p = HTML::TokeParser::Simple->new(url => $url);

 while ( my $token = $p->get_token ) {
     # This prints all text in an HTML doc (i.e., it strips the HTML)
     next unless $token->is_text;
     $page_source .= $token->as_is if $token->as_is !~ m/^</;
 }
print $page_source;
[download]

Back to Seekers of Perl Wisdom