in reply to HTML::Parser question
in thread HTML::Parser question
then this will simple remove the "<br>", thereby combining the two pieces of text into one single word "foobar".foo<br>bar
You could, nay should provide a way to replace significant tags with significant whitespace. For example, "<i>" and "<b>" tags can just go, but "<p>" and "<br>" would better be replaced with newlines. For example.
If've tried the following extension to your code, and it appears to work rather well.
{ package Example; use HTML::Parser; # plain text substitution for those tags that need it: my %tagtext = ( p => "\n\n", br => "\n", img => " "); @Example::ISA=qw(HTML::Parser); sub text { my($self, $text) = @_; $self->{TEXT}.=$text; } sub start { my($self, $tag, $attr, $attrseq, $origtext) = @_; defined(my $text = $tagtext{$tag}) or return; $self->{TEXT} .= $text; } } use LWP::Simple; $content = get("http://www.yahoo.com"); my $parser = Example->new(); $parser->parse($content); print $parser->{TEXT};
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: HTML::Parser question
by mkurtis (Scribe) on Mar 07, 2004 at 23:13 UTC | |
by Juerd (Abbot) on Mar 07, 2004 at 23:18 UTC | |
by mkurtis (Scribe) on Mar 07, 2004 at 23:44 UTC |