in reply to HTML::Parser question

It's hard to know exactly what's happening without seeing your code. I see two possibilities: What does your text capture subroutine look like? Are you handling any HTML tags, or just the text?

(This answer moved from the comment on the node about to be deleted because it was a duplicate)

Replies are listed 'Best First'.
HTML::Parser question
by mkurtis (Scribe) on Mar 07, 2004 at 19:17 UTC
    Here it is
    #!/usr/bin/perl -w package Example; use LWP::Simple; use HTML::Parser; @Example::ISA=qw(HTML::Parser); $content = get("http://www.yahoo.com"); my $parser = Example->new(); $parser->parse($content); print $parser->{TEXT}; sub text { my ($self,$text)=@_; $self->{TEXT}.=$text; }

    Thanks
      Ah, I think I see what's going on. All that this does is combine the text, while removing the tags. Nothing but removing the tags. That means dat if your HTML looks like
      foo<br>bar
      then this will simple remove the "<br>", thereby combining the two pieces of text into one single word "foobar".

      You could, nay should provide a way to replace significant tags with significant whitespace. For example, "<i>" and "<b>" tags can just go, but "<p>" and "<br>" would better be replaced with newlines. For example.

      If've tried the following extension to your code, and it appears to work rather well.

      { package Example; use HTML::Parser; # plain text substitution for those tags that need it: my %tagtext = ( p => "\n\n", br => "\n", img => " "); @Example::ISA=qw(HTML::Parser); sub text { my($self, $text) = @_; $self->{TEXT}.=$text; } sub start { my($self, $tag, $attr, $attrseq, $origtext) = @_; defined(my $text = $tagtext{$tag}) or return; $self->{TEXT} .= $text; } } use LWP::Simple; $content = get("http://www.yahoo.com"); my $parser = Example->new(); $parser->parse($content); print $parser->{TEXT};
        i tried your code as well bart, but it still combines some words, and now has & and the &nbsp between them. I dont know this is because the nonbreaking space are not tags so they arent removed, but how would i remove them and the &. Yahoo's clock code also shows up in the parser, ill see if its within tags as well.

        Thanks for your help.

      use strict;

      PN5

        Thanks PN5, however that doesn't make the script run any different, just makes me change $content to my $content.
        Thanks