in reply to Removing HTML Tags from a file

You mean, instead of deleting tags, you want to replace them with strings of spaces, so that the output file is the same size as the input, and just has a lot more space characters (and no html tags) -- have I got that right?

This would do:

#!/usr/bin/perl use strict; use HTML::TokeParser::Simple; my $htm = HTML::TokeParser::Simple->new( $ARGV[0] ) or die "oops: $!"; while ( my $token = $htm->get_token ) { if ( $token->is_tag() ) { print " " x length( $token->as_is ); } else { print $token->as_is; } }
If you look at the perldoc man page for HTML::TokeParser::Simple (and the "less simple" classes it is derived from), you might find it easy to come up with other more useful variants, and/or figure out handy ways to deal with things like scripting and comments that are often included in html files.

Replies are listed 'Best First'.
Re^2: Removing HTML Tags from a file
by agynr (Acolyte) on Dec 14, 2004 at 06:50 UTC
    Sir, I am not able to install the pm module for html:tokeparser:simple. I am currently using activestate perl on windows XP. I have simple.pm module in my d:\perl\lib\simple.pm. Please tell me the way to get it installed.
      Yes, you can.

      Open a "DOS" window and type:

      C:\Documents and Settings\Administrador>ppm

      You'll enter PPM's prompt. Then type:

      ppm> install HTML-TokeParser-Simple

      Regards,
        Thanx for that I have get it installed.

      agynr,

      You go through the active state documentation.

      In that they have described how to install a module using ppm.

      You follow the procedure and install it.

      Prasad