in reply to HTML::Parser, get rid of JavaScript

use HTML::Parser; use Text::Wrap; sub html2text { my $html = shift; my %inside; my $text = ''; my $tag = sub { $inside{$_[0]} += $_[1]; $text .= " " }; my $txt = sub { $text .= $_[0] unless $inside{script} or $inside{s +tyle} }; HTML::Parser->new( api_version => 3, handlers => [ start => [$tag, "tagname, '+1'"] +, end => [$tag, "tagname, '-1'"] +, text => [$txt, "dtext"] ], marked_sections => 1, )->parse($html); #$text =~ tr/\11\12\40-\176//cd; # remove wide non ascii chars $text = Text::Wrap::fill('', '', $text); $text =~ s/^\s+//; return $text; }

Update

Commeneted out arbitrary removal of non ascii chars as pointed out by moritz

Replies are listed 'Best First'.
Re^2: HTML::Parser, get rid of JavaScript
by moritz (Cardinal) on Jun 25, 2008 at 14:56 UTC
    # remove wide non ascii chars

    Why would you want to do that?

    Usually characters are in a string because they carry information - removing them by such a blind criterion as codepoint ranges almost surely implies data loss.

    There are many pages on the internet where next nothing remains if you remove all non-ASCII chars.

      Why would you want to do that?

      Fair point. In the application I cut and pasted it from I did want only ascii text..... I've commented it out