(The ridiculous number of lines and named variables are due to a step-by-step debugging inspection where I was trying to solve the utf8 problem -- essentially, make sure you have the latest HTML::Parser installed if you anticipate utf8 characters, and you may need to have them "marked" as such by going through the decode() function.)use HTML::TreeBuilder; use HTML::FormatText; use Encode; sub HTML_to_text { my $content = shift; my $html = HTML::TreeBuilder->new; $content = '<body>' . $content . '</body>' unless $content =~ m/<body[^>]*>/i; $html->parse( decode("utf8", $content) ); # this is necessary othe +rwise UTF8 chars get hamburgered, my $formatter = HTML::FormatText->new; my $out = _trim($formatter->format($html)); # trim is a selective + trimmer that preserves some kinds of whitespace, delete this if you +don't need it. return $out; }
If you only need to remove *some* tags, try HTML::TagFilter.
In reply to Re: Using HTML::Parser for simple tag removal
by rlucas
in thread Using HTML::Parser for simple tag removal
by bradcathey
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |