Re: Removing HTML Tags from a file

You mean, instead of deleting tags, you want to replace them with strings of spaces, so that the output file is the same size as the input, and just has a lot more space characters (and no html tags) -- have I got that right?

This would do:

#!/usr/bin/perl

use strict;
use HTML::TokeParser::Simple;

my $htm = HTML::TokeParser::Simple->new( $ARGV[0] ) or die "oops: $!";

while ( my $token = $htm->get_token )
{
    if ( $token->is_tag() ) {
        print " " x length( $token->as_is );
    } else {
        print $token->as_is;
    }
}
[download]

If you look at the perldoc man page for HTML::TokeParser::Simple (and the "less simple" classes it is derived from), you might find it easy to come up with other more useful variants, and/or figure out handy ways to deal with things like scripting and comments that are often included in html files.

Comment on Re: Removing HTML Tags from a file Download Code

Replies are listed 'Best First'.
Re^2: Removing HTML Tags from a file by agynr (Acolyte) on Dec 14, 2004 at 06:50 UTC
Sir, I am not able to install the pm module for html:tokeparser:simple. I am currently using activestate perl on windows XP. I have simple.pm module in my d:\perl\lib\simple.pm. Please tell me the way to get it installed.	[reply]
Re^3: Removing HTML Tags from a file by DaWolf (Curate) on Dec 14, 2004 at 06:56 UTC
Yes, you can. Open a "DOS" window and type: C:\Documents and Settings\Administrador>ppm You'll enter PPM's prompt. Then type: ppm> install HTML-TokeParser-Simple Regards, Er Galvão Abbott www.galvao.eti.br Porto Alegre Perl Mongers	[reply]
Re^4: Removing HTML Tags from a file by agynr (Acolyte) on Dec 14, 2004 at 07:08 UTC
Thanx for that I have get it installed.	[reply]
Re^3: Removing HTML Tags from a file by prasadbabu (Prior) on Dec 14, 2004 at 07:06 UTC
agynr, You go through the active state documentation. In that they have described how to install a module using ppm. You follow the procedure and install it. Prasad	[reply]
Re^3: Removing HTML Tags from a file by Anonymous Monk on Dec 15, 2004 at 02:25 UTC
A Guide to Installing Modules	[reply]