mkurtis has asked for the wisdom of the Perl Monks concerning the following question:

Ive been using HTML::Parser to get the text of of websites. Ive noticed that when i try to get the text off of yahoo it places the last word on a line in the same word as the first word on the nest line
Example
Stewart Conservatives result StewartConservatives
does anyone know how to make this not combine the words?
Thanks

Replies are listed 'Best First'.
Re: HTML::Parser question
by matija (Priest) on Mar 07, 2004 at 19:03 UTC
    It's hard to know exactly what's happening without seeing your code. I see two possibilities:
    • First, check the unbroken_text setting.
    • Perhaps yahoo is putting in <br> instead of newlines (yuck!). If you're ignoring HTML codes, you wouldn't see them. Hmmm. I just checked: they're not. And there seems to be whitespace between their <p>, which I think you should be getting in your text handler routine.
    What does your text capture subroutine look like? Are you handling any HTML tags, or just the text?

    (This answer moved from the comment on the node about to be deleted because it was a duplicate)

      Here it is
      #!/usr/bin/perl -w package Example; use LWP::Simple; use HTML::Parser; @Example::ISA=qw(HTML::Parser); $content = get("http://www.yahoo.com"); my $parser = Example->new(); $parser->parse($content); print $parser->{TEXT}; sub text { my ($self,$text)=@_; $self->{TEXT}.=$text; }

      Thanks
        Ah, I think I see what's going on. All that this does is combine the text, while removing the tags. Nothing but removing the tags. That means dat if your HTML looks like
        foo<br>bar
        then this will simple remove the "<br>", thereby combining the two pieces of text into one single word "foobar".

        You could, nay should provide a way to replace significant tags with significant whitespace. For example, "<i>" and "<b>" tags can just go, but "<p>" and "<br>" would better be replaced with newlines. For example.

        If've tried the following extension to your code, and it appears to work rather well.

        { package Example; use HTML::Parser; # plain text substitution for those tags that need it: my %tagtext = ( p => "\n\n", br => "\n", img => " "); @Example::ISA=qw(HTML::Parser); sub text { my($self, $text) = @_; $self->{TEXT}.=$text; } sub start { my($self, $tag, $attr, $attrseq, $origtext) = @_; defined(my $text = $tagtext{$tag}) or return; $self->{TEXT} .= $text; } } use LWP::Simple; $content = get("http://www.yahoo.com"); my $parser = Example->new(); $parser->parse($content); print $parser->{TEXT};

        use strict;

        PN5

Re: HTML::Parser question
by Juerd (Abbot) on Mar 07, 2004 at 20:18 UTC

    does anyone know how to make this not combine the words?

    Are you sure *it* is combining the words? I think your code is doing that. If your sub gets called multiple times, that is because there were tags in between. You do nothing with those tags, but it is very likely that they were meant to render as some sort of white space.

    For formatting HTML as plain text, have a look at HTML::FormatText, or consider using w3m -dump, links -dump or lynx -dump.

    A quick and ugly fix for your problem would probably be having start and end handlers that add a single space to the string and a substitution on eof to remove duplicate whitespace.

    Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

      thanks juerd, that quick and ugly fix that you were talking about, wuold that be putting each word on a seperate line? i looked at that HTML::FormatText module, but i think that if parser just stuck every word on a new line it would work, and all would be well. Do you bychance know how to do this.

      Thanks

        thanks juerd, that quick and ugly fix that you were talking about, wuold that be putting each word on a seperate line? i looked at that HTML::FormatText module, but i think that if parser just stuck every word on a new line it would work, and all would be well.

        That "fix" would do whatever you program it to do. It is not the parser's job to modify anything. It parses and does that well.

        Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

        Looking at the code you posted in an earlier reply, you could change this line (in "sub text {...}")
        $self->{TEXT}.=$text;
        to read as follows:
        $self->{TEXT}.="$text\n";
        I tried your code with this mod, and the result might still not be exactly what you wanted (I saw "nbsp", HTML comments, other "funny character" entities (©, •, etc.) -- I think you'll find a way to handle these with HTML::Entities; also, depending on how far you want to go with filtering the yahoo page content to get rid of irrelevant stuff (like the comments, the scripting, the forms, etc), you might get good mileage out of HTML::TokeParser or it's ::Simple variant (same functionality, different API).
Re: HTML::Parser question
by neniro (Priest) on Mar 07, 2004 at 20:55 UTC
    If you just want to extract the text from a Website HTML-Strip could be interessting too.

    AddOn:

    #!/usr/bin/perl use strict; use warnings; use LWP::Simple; use HTML::Strip; my $hs = HTML::Strip->new(); print $hs->parse( get('http://www.perlmonks.org/') ); $hs->eof;
      I tried your code but noticed that it doesnt get all of yahoo's content,just the main part with the directory and none of it is tabbed box text,(where the news is). I dont understand why it wouldnt extract all the text, however it is not combining words anymore, thanks for that, there is no more docs about it on cpan, ill try and find others, im using your exact code except i changed perlmonks to yahoo.

      Thanks