HTML::Parser question

mkurtis has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: HTML::Parser question by matija (Priest) on Mar 07, 2004 at 19:03 UTC
It's hard to know exactly what's happening without seeing your code. I see two possibilities: First, check the `unbroken_text` setting. Perhaps yahoo is putting in `<br>` instead of newlines (yuck!). If you're ignoring HTML codes, you wouldn't see them. Hmmm. I just checked: they're not. And there seems to be whitespace between their `<p>, which I think you should be getting in your text handler routine.` What does your text capture subroutine look like? Are you handling any HTML tags, or just the text? (This answer moved from the comment on the node about to be deleted because it was a duplicate)	[reply] [d/l]
HTML::Parser question by mkurtis (Scribe) on Mar 07, 2004 at 19:17 UTC
Here it is `#!/usr/bin/perl -w package Example; use LWP::Simple; use HTML::Parser; @Example::ISA=qw(HTML::Parser); $content = get("http://www.yahoo.com"); my $parser = Example->new(); $parser->parse($content); print $parser->{TEXT}; sub text { my ($self,$text)=@_; $self->{TEXT}.=$text; }` [download] Thanks	[reply] [d/l]
Re: HTML::Parser question by bart (Canon) on Mar 07, 2004 at 21:50 UTC
Ah, I think I see what's going on. All that this does is combine the text, while removing the tags. Nothing but removing the tags. That means dat if your HTML looks like `foo<br>bar` [download] then this will simple remove the "`<br>`", thereby combining the two pieces of text into one single word "foobar". You could, nay should provide a way to replace significant tags with significant whitespace. For example, "`<i>`" and "`<b>`" tags can just go, but "`<p>`" and "`<br>`" would better be replaced with newlines. For example. If've tried the following extension to your code, and it appears to work rather well. { package Example; use HTML::Parser; # plain text substitution for those tags that need it: my %tagtext = ( p => "\n\n", br => "\n", img => " "); @Example::ISA=qw(HTML::Parser); sub text { my($self, $text) = @_; $self->{TEXT}.=$text; } sub start { my($self, $tag, $attr, $attrseq, $origtext) = @_; defined(my $text = $tagtext{$tag}) or return; $self->{TEXT} .= $text; } } use LWP::Simple; $content = get("http://www.yahoo.com"); my $parser = Example->new(); $parser->parse($content); print $parser->{TEXT}; [download]	[reply] [d/l] [select]
Re: HTML::Parser question by mkurtis (Scribe) on Mar 07, 2004 at 23:13 UTC
Re: Re: HTML::Parser question by Juerd (Abbot) on Mar 07, 2004 at 23:18 UTC
Some notes below your chosen depth have not been shown here
Re: HTML::Parser question by Prior Nacre V (Hermit) on Mar 07, 2004 at 20:17 UTC
use strict; PN5	[reply]
Re: HTML::Parser question by mkurtis (Scribe) on Mar 07, 2004 at 20:38 UTC
Re: HTML::Parser question by Juerd (Abbot) on Mar 07, 2004 at 20:18 UTC
does anyone know how to make this not combine the words? Are you sure it is combining the words? I think your code is doing that. If your sub gets called multiple times, that is because there were tags in between. You do nothing with those tags, but it is very likely that they were meant to render as some sort of white space. For formatting HTML as plain text, have a look at HTML::FormatText, or consider using `w3m -dump`, `links -dump` or `lynx -dump`. A quick and ugly fix for your problem would probably be having start and end handlers that add a single space to the string and a substitution on eof to remove duplicate whitespace. Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }	[reply]
Re: HTML::Parser question by mkurtis (Scribe) on Mar 07, 2004 at 20:54 UTC
thanks juerd, that quick and ugly fix that you were talking about, wuold that be putting each word on a seperate line? i looked at that HTML::FormatText module, but i think that if parser just stuck every word on a new line it would work, and all would be well. Do you bychance know how to do this. Thanks	[reply]
Re: Re: HTML::Parser question by Juerd (Abbot) on Mar 07, 2004 at 21:36 UTC
thanks juerd, that quick and ugly fix that you were talking about, wuold that be putting each word on a seperate line? i looked at that HTML::FormatText module, but i think that if parser just stuck every word on a new line it would work, and all would be well. That "fix" would do whatever you program it to do. It is not the parser's job to modify anything. It parses and does that well. Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }	[reply]
Re: Re: HTML::Parser question by graff (Chancellor) on Mar 08, 2004 at 03:03 UTC
Looking at the code you posted in an earlier reply, you could change this line (in "sub text {...}") `$self->{TEXT}.=$text;` [download] to read as follows: `$self->{TEXT}.="$text\n";` [download] I tried your code with this mod, and the result might still not be exactly what you wanted (I saw "nbsp", HTML comments, other "funny character" entities (©, •, etc.) -- I think you'll find a way to handle these with HTML::Entities; also, depending on how far you want to go with filtering the yahoo page content to get rid of irrelevant stuff (like the comments, the scripting, the forms, etc), you might get good mileage out of HTML::TokeParser or it's ::Simple variant (same functionality, different API).	[reply] [d/l] [select]
Re: Re: Re: HTML::Parser question by mkurtis (Scribe) on Mar 09, 2004 at 00:54 UTC
Re: HTML::Parser question by neniro (Priest) on Mar 07, 2004 at 20:55 UTC
If you just want to extract the text from a Website HTML-Strip could be interessting too. AddOn: `#!/usr/bin/perl use strict; use warnings; use LWP::Simple; use HTML::Strip; my $hs = HTML::Strip->new(); print $hs->parse( get('http://www.perlmonks.org/') ); $hs->eof;` [download]	[reply] [d/l]
Re: HTML::Parser question by mkurtis (Scribe) on Mar 07, 2004 at 22:48 UTC
I tried your code but noticed that it doesnt get all of yahoo's content,just the main part with the directory and none of it is tabbed box text,(where the news is). I dont understand why it wouldnt extract all the text, however it is not combining words anymore, thanks for that, there is no more docs about it on cpan, ill try and find others, im using your exact code except i changed perlmonks to yahoo. Thanks	[reply]