then this will simple remove the "<br>", thereby combining the two pieces of text into one single word "foobar".foo<br>bar
You could, nay should provide a way to replace significant tags with significant whitespace. For example, "<i>" and "<b>" tags can just go, but "<p>" and "<br>" would better be replaced with newlines. For example.
If've tried the following extension to your code, and it appears to work rather well.
{ package Example; use HTML::Parser; # plain text substitution for those tags that need it: my %tagtext = ( p => "\n\n", br => "\n", img => " "); @Example::ISA=qw(HTML::Parser); sub text { my($self, $text) = @_; $self->{TEXT}.=$text; } sub start { my($self, $tag, $attr, $attrseq, $origtext) = @_; defined(my $text = $tagtext{$tag}) or return; $self->{TEXT} .= $text; } } use LWP::Simple; $content = get("http://www.yahoo.com"); my $parser = Example->new(); $parser->parse($content); print $parser->{TEXT};
In reply to Re: HTML::Parser question
by bart
in thread HTML::Parser question
by mkurtis
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |