A short HTML::Parser hack that will strip all the text content from an HTML document.
package TextStrip; use strict; my $strip_text; use base 'HTML::Parser'; sub text { $strip_text .= $_[1] } my $parser = new TextStrip; my $fh = *DATA; # open $fh onto DATA for demo $parser->parse_file($fh) && print $strip_text; __DATA__ <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=windows-12 +52"> <title>Index</title> </head> <body> <h1>Hello World</h1> <p>Just Another <p>Parser Hack </body> </html>

Replies are listed 'Best First'.
Re: Strip text from HTML
by briac (Sexton) on Oct 02, 2001 at 04:12 UTC

    Nice one, here's how to do it using the HTML::Parser v.3 interface

    #!/usr/bin/perl -w use strict; use HTML::Parser 3; my $parser = HTML::Parser->new( text_h => [ sub { print shift }, 'dtext' ] )->parse_file(*DATA); __DATA__ <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=windows-12 +52"> <title>Index</title> </head> <body> <h1>Hello World</h1> <p>Just Another <p>Parser Hack </body> </html>

    Cheers,
    briac

      Now that is a brief hack! I've got used to the v2 interface because it is so simple although the code always seems a little gawky. You've inspired me to have another go at learning the version 3 interface.

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print