in reply to HTML::Parser question
in thread HTML::Parser question

Ah, I think I see what's going on. All that this does is combine the text, while removing the tags. Nothing but removing the tags. That means dat if your HTML looks like
foo<br>bar
then this will simple remove the "<br>", thereby combining the two pieces of text into one single word "foobar".

You could, nay should provide a way to replace significant tags with significant whitespace. For example, "<i>" and "<b>" tags can just go, but "<p>" and "<br>" would better be replaced with newlines. For example.

If've tried the following extension to your code, and it appears to work rather well.

{ package Example; use HTML::Parser; # plain text substitution for those tags that need it: my %tagtext = ( p => "\n\n", br => "\n", img => " "); @Example::ISA=qw(HTML::Parser); sub text { my($self, $text) = @_; $self->{TEXT}.=$text; } sub start { my($self, $tag, $attr, $attrseq, $origtext) = @_; defined(my $text = $tagtext{$tag}) or return; $self->{TEXT} .= $text; } } use LWP::Simple; $content = get("http://www.yahoo.com"); my $parser = Example->new(); $parser->parse($content); print $parser->{TEXT};

Replies are listed 'Best First'.
Re: HTML::Parser question
by mkurtis (Scribe) on Mar 07, 2004 at 23:13 UTC
    i tried your code as well bart, but it still combines some words, and now has & and the &nbsp between them. I dont know this is because the nonbreaking space are not tags so they arent removed, but how would i remove them and the &. Yahoo's clock code also shows up in the parser, ill see if its within tags as well.

    Thanks for your help.

      You really need to learn Perl before using it. If you use HTML::Parser, first learn how HTML works and then how HTML::Parser works (all after learning Perl).

      Quoth Scott Walters:

      Perl programming requires three skills:

      • Knowledge of the syntax and features of the core language. The Beginning Perl thing is the path to that.
      • CPAN. Anything more complex than a few lines of Perl, go running to http://search.cpan.org/ and search. There is probably a module to do what you want to do.
      • Critical thinking. You're on your own there.

      No language can get rid of the need for critical thinking, though many languages downplay the importance of it, or even scoff at it.

      You have the CPAN thing figured out, but lack critical thinking and knowledge of the language.

      Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

        Im working on it Juerd, ive got my learn perl in 24 hours (not possible) book and have been trying my best since Feb 19th. Any way, i thought perhaps you would have an answer to my parser question, I think i might go with HTML-strip instead as it parses better, but not the whole site, maybe you know why, as of now i havent got any replies about that.

        thanks