http://qs1969.pair.com?node_id=537802

coldfingertips has asked for the wisdom of the Perl Monks concerning the following question:

I need an accurate way to pull just the readable text from a web page. I was told HTML::TokeParser / Simple would work. The thing is, it's bringing back some css and javascript tags too, including Google Ad source code.

On top of this, there is a lot of   and li tags in the page dump, too. I can filter these out I suppose in regexes, but there's no way I can account for everything that this module misses.

Also, it misprints some data, too. The below script prints '0Items in cart' for example, there IS a space there on the page.

Is there an accurate way to do this?

#!/usr/bin/perl use warnings; use strict; my $url = "http://www.sensationalscentsonline.com"; my $page_source; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new(url => $url); while ( my $token = $p->get_token ) { # This prints all text in an HTML doc (i.e., it strips the HTML) next unless $token->is_text; $page_source .= $token->as_is if $token->as_is !~ m/^</; } print $page_source;