in reply to HTML::Parser, get rid of JavaScript
use HTML::Parser; use Text::Wrap; sub html2text { my $html = shift; my %inside; my $text = ''; my $tag = sub { $inside{$_[0]} += $_[1]; $text .= " " }; my $txt = sub { $text .= $_[0] unless $inside{script} or $inside{s +tyle} }; HTML::Parser->new( api_version => 3, handlers => [ start => [$tag, "tagname, '+1'"] +, end => [$tag, "tagname, '-1'"] +, text => [$txt, "dtext"] ], marked_sections => 1, )->parse($html); #$text =~ tr/\11\12\40-\176//cd; # remove wide non ascii chars $text = Text::Wrap::fill('', '', $text); $text =~ s/^\s+//; return $text; }
Commeneted out arbitrary removal of non ascii chars as pointed out by moritz
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: HTML::Parser, get rid of JavaScript
by moritz (Cardinal) on Jun 25, 2008 at 14:56 UTC | |
by tachyon-II (Chaplain) on Jun 25, 2008 at 16:26 UTC |