Re: HTML::Parser question

It's hard to know exactly what's happening without seeing your code. I see two possibilities:

First, check the unbroken_text setting.
Perhaps yahoo is putting in <br> instead of newlines (yuck!). If you're ignoring HTML codes, you wouldn't see them. Hmmm. I just checked: they're not. And there seems to be whitespace between their <p>, which I think you should be getting in your text handler routine.

What does your text capture subroutine look like? Are you handling any HTML tags, or just the text?

(This answer moved from the comment on the node about to be deleted because it was a duplicate)

Comment on Re: HTML::Parser question Download Code

Replies are listed 'Best First'.
HTML::Parser question by mkurtis (Scribe) on Mar 07, 2004 at 19:17 UTC
Here it is `#!/usr/bin/perl -w package Example; use LWP::Simple; use HTML::Parser; @Example::ISA=qw(HTML::Parser); $content = get("http://www.yahoo.com"); my $parser = Example->new(); $parser->parse($content); print $parser->{TEXT}; sub text { my ($self,$text)=@_; $self->{TEXT}.=$text; }` [download] Thanks	[reply] [d/l]
Re: HTML::Parser question by bart (Canon) on Mar 07, 2004 at 21:50 UTC
Ah, I think I see what's going on. All that this does is combine the text, while removing the tags. Nothing but removing the tags. That means dat if your HTML looks like `foo<br>bar` [download] then this will simple remove the "`<br>`", thereby combining the two pieces of text into one single word "foobar". You could, nay should provide a way to replace significant tags with significant whitespace. For example, "`<i>`" and "`<b>`" tags can just go, but "`<p>`" and "`<br>`" would better be replaced with newlines. For example. If've tried the following extension to your code, and it appears to work rather well. { package Example; use HTML::Parser; # plain text substitution for those tags that need it: my %tagtext = ( p => "\n\n", br => "\n", img => " "); @Example::ISA=qw(HTML::Parser); sub text { my($self, $text) = @_; $self->{TEXT}.=$text; } sub start { my($self, $tag, $attr, $attrseq, $origtext) = @_; defined(my $text = $tagtext{$tag}) or return; $self->{TEXT} .= $text; } } use LWP::Simple; $content = get("http://www.yahoo.com"); my $parser = Example->new(); $parser->parse($content); print $parser->{TEXT}; [download]	[reply] [d/l] [select]
Re: HTML::Parser question by mkurtis (Scribe) on Mar 07, 2004 at 23:13 UTC
i tried your code as well bart, but it still combines some words, and now has & and the `&nbsp` between them. I dont know this is because the nonbreaking space are not tags so they arent removed, but how would i remove them and the &. Yahoo's clock code also shows up in the parser, ill see if its within tags as well. Thanks for your help.	[reply] [d/l]
Re: Re: HTML::Parser question by Juerd (Abbot) on Mar 07, 2004 at 23:18 UTC
Re: Re: Re: HTML::Parser question by mkurtis (Scribe) on Mar 07, 2004 at 23:44 UTC
Re: HTML::Parser question by Prior Nacre V (Hermit) on Mar 07, 2004 at 20:17 UTC
use strict; PN5	[reply]
Re: HTML::Parser question by mkurtis (Scribe) on Mar 07, 2004 at 20:38 UTC
Thanks PN5, however that doesn't make the script run any different, just makes me change $content to my $content. Thanks	[reply]