in reply to Using HTML::Parser for simple tag removal
(The ridiculous number of lines and named variables are due to a step-by-step debugging inspection where I was trying to solve the utf8 problem -- essentially, make sure you have the latest HTML::Parser installed if you anticipate utf8 characters, and you may need to have them "marked" as such by going through the decode() function.)use HTML::TreeBuilder; use HTML::FormatText; use Encode; sub HTML_to_text { my $content = shift; my $html = HTML::TreeBuilder->new; $content = '<body>' . $content . '</body>' unless $content =~ m/<body[^>]*>/i; $html->parse( decode("utf8", $content) ); # this is necessary othe +rwise UTF8 chars get hamburgered, my $formatter = HTML::FormatText->new; my $out = _trim($formatter->format($html)); # trim is a selective + trimmer that preserves some kinds of whitespace, delete this if you +don't need it. return $out; }
If you only need to remove *some* tags, try HTML::TagFilter.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
HTML::FromText
by marnanel (Beadle) on May 26, 2005 at 20:57 UTC |