Ah, I think I see what's going on. All that this does is combine the text, while removing the tags. Nothing but removing the tags. That means dat if your HTML looks like
foo<br>bar
then this will simple remove the "<br>", thereby combining the two pieces of text into one single word "foobar".
You could, nay should provide a way to replace significant tags with significant whitespace. For example, "<i>" and "<b>" tags can just go, but "<p>" and "<br>" would better be replaced with newlines. For example.
If've tried the following extension to your code, and it appears to work rather well.
{
package Example;
use HTML::Parser;
# plain text substitution for those tags that need it:
my %tagtext = ( p => "\n\n", br => "\n", img => " ");
@Example::ISA=qw(HTML::Parser);
sub text {
my($self, $text) = @_;
$self->{TEXT}.=$text;
}
sub start {
my($self, $tag, $attr, $attrseq, $origtext) = @_;
defined(my $text = $tagtext{$tag}) or return;
$self->{TEXT} .= $text;
}
}
use LWP::Simple;
$content = get("http://www.yahoo.com");
my $parser = Example->new();
$parser->parse($content);
print $parser->{TEXT};
|