vit has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I use the following code to get text from the given url (actually html)
#!/usr/bin/perl use strict; use warnings; my $the_file; use LWP::Simple; #$the_file = get("http://www.perlmonks.org"); # or $the_file = get("http://search.yahoo.com/search?p=hotel&fr=yfp-t-103&t +oggle=1&cop=mss&ei=UTF-8"); use HTML::Parser; my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext +" ], start_document_h => [\&init, "self"] ); $parser->parse($the_file); print @{$parser->{_private}->{text}}; sub init { my ( $self ) = @_; $self->{_private}->{text} = []; } sub text_handler { my ( $self, $text) = @_; push @{$self->{_private}->{text}}, $text; }
It works pretty good but returns JavaScript code at the end. How can I get rid of it?

Replies are listed 'Best First'.
Re: HTML::Parser, get rid of JavaScript
by pc88mxer (Vicar) on Jun 25, 2008 at 00:49 UTC
    I'll just outline a solution and leave the details to you. Basically, add a flag which tells text_handler() whether or not to append text:
    my $ok_to_add_text; ... sub text_handler { ... if ($ok_to_add_text) { push ... } }
    Then add a handler to detect the tags <SCRIPT> and </SCRIPT>. Turn off $ok_to_add_text when you see the first tag and turn it back on when you see the second one. You can also use this approach to avoid getting the CSS at the beginning (i.e. the text that appears in the STYLE tag.)
Re: HTML::Parser, get rid of JavaScript
by Anonymous Monk on Jun 25, 2008 at 05:47 UTC
Re: HTML::Parser, get rid of JavaScript
by tachyon-II (Chaplain) on Jun 25, 2008 at 14:24 UTC
    use HTML::Parser; use Text::Wrap; sub html2text { my $html = shift; my %inside; my $text = ''; my $tag = sub { $inside{$_[0]} += $_[1]; $text .= " " }; my $txt = sub { $text .= $_[0] unless $inside{script} or $inside{s +tyle} }; HTML::Parser->new( api_version => 3, handlers => [ start => [$tag, "tagname, '+1'"] +, end => [$tag, "tagname, '-1'"] +, text => [$txt, "dtext"] ], marked_sections => 1, )->parse($html); #$text =~ tr/\11\12\40-\176//cd; # remove wide non ascii chars $text = Text::Wrap::fill('', '', $text); $text =~ s/^\s+//; return $text; }

    Update

    Commeneted out arbitrary removal of non ascii chars as pointed out by moritz

      # remove wide non ascii chars

      Why would you want to do that?

      Usually characters are in a string because they carry information - removing them by such a blind criterion as codepoint ranges almost surely implies data loss.

      There are many pages on the internet where next nothing remains if you remove all non-ASCII chars.

        Why would you want to do that?

        Fair point. In the application I cut and pasted it from I did want only ascii text..... I've commented it out