Re: HTML::Parser question
by matija (Priest) on Mar 07, 2004 at 19:03 UTC
|
| [reply] [d/l] |
|
|
#!/usr/bin/perl -w
package Example;
use LWP::Simple;
use HTML::Parser;
@Example::ISA=qw(HTML::Parser);
$content = get("http://www.yahoo.com");
my $parser = Example->new();
$parser->parse($content);
print $parser->{TEXT};
sub text
{
my ($self,$text)=@_;
$self->{TEXT}.=$text;
}
Thanks | [reply] [d/l] |
|
|
Ah, I think I see what's going on. All that this does is combine the text, while removing the tags. Nothing but removing the tags. That means dat if your HTML looks like
foo<br>bar
then this will simple remove the "<br>", thereby combining the two pieces of text into one single word "foobar".
You could, nay should provide a way to replace significant tags with significant whitespace. For example, "<i>" and "<b>" tags can just go, but "<p>" and "<br>" would better be replaced with newlines. For example.
If've tried the following extension to your code, and it appears to work rather well.
{
package Example;
use HTML::Parser;
# plain text substitution for those tags that need it:
my %tagtext = ( p => "\n\n", br => "\n", img => " ");
@Example::ISA=qw(HTML::Parser);
sub text {
my($self, $text) = @_;
$self->{TEXT}.=$text;
}
sub start {
my($self, $tag, $attr, $attrseq, $origtext) = @_;
defined(my $text = $tagtext{$tag}) or return;
$self->{TEXT} .= $text;
}
}
use LWP::Simple;
$content = get("http://www.yahoo.com");
my $parser = Example->new();
$parser->parse($content);
print $parser->{TEXT};
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
| [reply] |
|
|
Re: HTML::Parser question
by Juerd (Abbot) on Mar 07, 2004 at 20:18 UTC
|
does anyone know how to make this not combine the words?
Are you sure *it* is combining the words? I think your code is doing that. If your sub gets called multiple times, that is because there were tags in between. You do nothing with those tags, but it is very likely that they were meant to render as some sort of white space.
For formatting HTML as plain text, have a look at HTML::FormatText, or consider using w3m -dump, links -dump or lynx -dump.
A quick and ugly fix for your problem would probably be having start and end handlers that add a single space to the string and a substitution on eof to remove duplicate whitespace.
| [reply] |
|
|
thanks juerd, that quick and ugly fix that you were talking about, wuold that be putting each word on a seperate line? i looked at that HTML::FormatText module, but i think that if parser just stuck every word on a new line it would work, and all would be well. Do you bychance know how to do this.
Thanks
| [reply] |
|
|
| [reply] |
|
|
Looking at the code you posted in an earlier reply, you could change this line (in "sub text {...}")
$self->{TEXT}.=$text;
to read as follows:
$self->{TEXT}.="$text\n";
I tried your code with this mod, and the result might still not be exactly what you wanted (I saw "nbsp", HTML comments, other "funny character" entities (©, , etc.) -- I think you'll find a way to handle these with HTML::Entities; also, depending on how far you want to go with filtering the yahoo page content to get rid of irrelevant stuff (like the comments, the scripting, the forms, etc), you might get good mileage out of HTML::TokeParser or it's ::Simple variant (same functionality, different API). | [reply] [d/l] [select] |
|
|
Re: HTML::Parser question
by neniro (Priest) on Mar 07, 2004 at 20:55 UTC
|
If you just want to extract the text from a Website HTML-Strip could be interessting too.
AddOn:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use HTML::Strip;
my $hs = HTML::Strip->new();
print $hs->parse( get('http://www.perlmonks.org/') );
$hs->eof;
| [reply] [d/l] |
|
|
I tried your code but noticed that it doesnt get all of yahoo's content,just the main part with the directory and none of it is tabbed box text,(where the news is). I dont understand why it wouldnt extract all the text, however it is not combining words anymore, thanks for that, there is no more docs about it on cpan, ill try and find others, im using your exact code except i changed perlmonks to yahoo.
Thanks
| [reply] |