Re: Remove HTML tags from document

You could use HTML::TokeParser::Simple and only print text tags.

#almost straight from the TokeParser::Simple POD

use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new( $somefile );

while ( my $token = $p->get_token ) {
          
     print $token->as_is if $token->is_text;
}
[download]

HTH

Comment on Re: Remove HTML tags from document Download Code

Replies are listed 'Best First'.
Re: Re: Remove HTML tags from document by matth (Monk) on Aug 04, 2003 at 09:18 UTC
This works nicely. Is there an easy adapation that would allow me to maintain spacing that is in the HTML document?	[reply]
Re: Re: Re: Remove HTML tags from document by pzbagel (Chaplain) on Aug 04, 2003 at 09:47 UTC
I'm not sure I understand. I recall that HTML::TokeParser::Simple does in fact maintain newlines in the text. I tested the code quickly just to make sure and it does maintain newlines in the html. Do you have tags that are multi-line? What exactly is happening?	[reply]
Re: Re: Re: Re: Remove HTML tags from document by matth (Monk) on Aug 04, 2003 at 10:04 UTC
I have tables where I would like to maintain the tabs.	[reply]