HTML::Parser, get rid of JavaScript

vit has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I use the following code to get text from the given url (actually html)

#!/usr/bin/perl

use strict;
use warnings;

my $the_file;

use LWP::Simple;
#$the_file = get("http://www.perlmonks.org");
#  or
$the_file = get("http://search.yahoo.com/search?p=hotel&fr=yfp-t-103&t
+oggle=1&cop=mss&ei=UTF-8");

use HTML::Parser;
my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext
+" ], start_document_h => [\&init, "self"] );

$parser->parse($the_file);

print @{$parser->{_private}->{text}};

sub init
{
   my ( $self ) = @_;
   $self->{_private}->{text} = [];
}

sub text_handler
{
    my ( $self, $text) = @_;

    push @{$self->{_private}->{text}}, $text;
}
[download]

It works pretty good but returns JavaScript code at the end. How can I get rid of it?

Comment on HTML::Parser, get rid of JavaScript Download Code

Replies are listed 'Best First'.
Re: HTML::Parser, get rid of JavaScript by pc88mxer (Vicar) on Jun 25, 2008 at 00:49 UTC
I'll just outline a solution and leave the details to you. Basically, add a flag which tells `text_handler()` whether or not to append text: `my $ok_to_add_text; ... sub text_handler { ... if ($ok_to_add_text) { push ... } }` [download] Then add a handler to detect the tags `<SCRIPT>` and `</SCRIPT>`. Turn off `$ok_to_add_text` when you see the first tag and turn it back on when you see the second one. You can also use this approach to avoid getting the CSS at the beginning (i.e. the text that appears in the STYLE tag.)	[reply] [d/l] [select]
Re: HTML::Parser, get rid of JavaScript by Anonymous Monk on Jun 25, 2008 at 05:47 UTC
use HTML::StripScripts	[reply]
Re: HTML::Parser, get rid of JavaScript by tachyon-II (Chaplain) on Jun 25, 2008 at 14:24 UTC
use HTML::Parser; use Text::Wrap; sub html2text { my $html = shift; my %inside; my $text = ''; my $tag = sub { $inside{$_[0]} += $_[1]; $text .= " " }; my $txt = sub { $text .= $_[0] unless $inside{script} or $inside{s +tyle} }; HTML::Parser->new( api_version => 3, handlers => [ start => [$tag, "tagname, '+1'"] +, end => [$tag, "tagname, '-1'"] +, text => [$txt, "dtext"] ], marked_sections => 1, )->parse($html); #$text =~ tr/\11\12\40-\176//cd; # remove wide non ascii chars $text = Text::Wrap::fill('', '', $text); $text =~ s/^\s+//; return $text; } [download] Update Commeneted out arbitrary removal of non ascii chars as pointed out by moritz	[reply] [d/l]
Re^2: HTML::Parser, get rid of JavaScript by moritz (Cardinal) on Jun 25, 2008 at 14:56 UTC
`# remove wide non ascii chars` Why would you want to do that? Usually characters are in a string because they carry information - removing them by such a blind criterion as codepoint ranges almost surely implies data loss. There are many pages on the internet where next nothing remains if you remove all non-ASCII chars.	[reply] [d/l]
Re^3: HTML::Parser, get rid of JavaScript by tachyon-II (Chaplain) on Jun 25, 2008 at 16:26 UTC
Why would you want to do that? Fair point. In the application I cut and pasted it from I did want only ascii text..... I've commented it out	[reply]

Update