Strip text from HTML

A short HTML::Parser hack that will strip all the text content from an HTML document.

package TextStrip;
use strict;
my $strip_text;
use base 'HTML::Parser';
sub text { $strip_text .= $_[1] }

my $parser = new TextStrip;
my $fh = *DATA; # open $fh onto DATA for demo
$parser->parse_file($fh) && print $strip_text;

__DATA__
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-12
+52">
<title>Index</title>
</head>
<body>
<h1>Hello World</h1>
<p>Just Another
<p>Parser Hack
</body>
</html>
[download]

Comment on Strip text from HTML Download Code

Replies are listed 'Best First'.
Re: Strip text from HTML by briac (Sexton) on Oct 02, 2001 at 04:12 UTC
Nice one, here's how to do it using the HTML::Parser v.3 interface `#!/usr/bin/perl -w use strict; use HTML::Parser 3; my $parser = HTML::Parser->new( text_h => [ sub { print shift }, 'dtext' ] )->parse_file(*DATA); __DATA__ <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=windows-12 +52"> <title>Index</title> </head> <body> <h1>Hello World</h1> <p>Just Another <p>Parser Hack </body> </html>` [download] Cheers, briac	[reply] [d/l]
Re: Re: Strip text from HTML by tachyon (Chancellor) on Oct 02, 2001 at 05:52 UTC
Now that is a brief hack! I've got used to the v2 interface because it is so simple although the code always seems a little gawky. You've inspired me to have another go at learning the version 3 interface. cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]