Re: HTML::Parser, get rid of JavaScript

use HTML::Parser;
use Text::Wrap;

sub html2text {
    my $html = shift;
    my %inside;
    my $text = '';
    my $tag = sub { $inside{$_[0]} += $_[1]; $text .= " " };
    my $txt = sub { $text .= $_[0] unless $inside{script} or $inside{s
+tyle} };
    HTML::Parser->new(  api_version => 3,
                        handlers => [ start => [$tag, "tagname, '+1'"]
+,
                                      end   => [$tag, "tagname, '-1'"]
+,
                                      text  => [$txt, "dtext"] ],
                        marked_sections => 1,
    )->parse($html);
    #$text =~ tr/\11\12\40-\176//cd; # remove wide non ascii chars
    $text = Text::Wrap::fill('', '', $text);
    $text =~ s/^\s+//;
    return $text;
}
[download]

Update

Commeneted out arbitrary removal of non ascii chars as pointed out by moritz

Comment on Re: HTML::Parser, get rid of JavaScript Download Code

Replies are listed 'Best First'.
Re^2: HTML::Parser, get rid of JavaScript by moritz (Cardinal) on Jun 25, 2008 at 14:56 UTC
`# remove wide non ascii chars` Why would you want to do that? Usually characters are in a string because they carry information - removing them by such a blind criterion as codepoint ranges almost surely implies data loss. There are many pages on the internet where next nothing remains if you remove all non-ASCII chars.	[reply] [d/l]
Re^3: HTML::Parser, get rid of JavaScript by tachyon-II (Chaplain) on Jun 25, 2008 at 16:26 UTC
Why would you want to do that? Fair point. In the application I cut and pasted it from I did want only ascii text..... I've commented it out	[reply]