How to remove HTML tags from text

nejcPirc has asked for the wisdom of the Perl Monks concerning the following question:

Hello perl monks,

Can you explain me how to filter text data which i get from my form...

With my CGI I colect data from my form(textarea for example)...
Now i have colected data in my $data='abcd efgh<img src="http://test.com/image.gif">ijklmn';

Now I would like to use some filter to clear out HTML TAGS like "<img src="http://test.com/image.gif">" from my $data, so on the end it shoul contain no html tags, only text: $data='abcd efgh ijklmn';

Thanks a lot, Nejc

Comment on How to remove HTML tags from text

Replies are listed 'Best First'.
Re: How to remove HTML tags from text by skx (Parson) on Feb 04, 2005 at 12:10 UTC
You can do better than just replacing the "<" and ">" characters - as these do not prevent all attacks. Have a look at HTML::Scrubber and HTML::Filter. Using the first is as simple as this: my @allow = qw[ ul li ol p br hr b a i pre blockquote tt dl dd dt ]; my @rules = ( script => 0, img => { src => qr{^(http://)}i, # only absolute image links allowed alt => 1, # alt attribute allowed '' => 0, # deny all other attributes }, a => { href => 1, # HREF title => 1, # ALT attribute allowed rel => 1, # Link relationship '' => 0, # deny all other attributes }, ); # my @default = ( 0 => # default rule, deny all tags { '*' => 1, # default rule, allow all attributes 'href' => qr{^(?!(?:java)?script)}i, 'src' => qr{^(?!(?:java)?script)}i, 'cite' => '(?i-xsm:^(?!(?:java)?script))', 'language' => 0, 'name' => 1, # could be sneaky, but hey ;) 'onblur' => 0, 'onchange' => 0, 'onclick' => 0, 'ondblclick' => 0, 'onerror' => 0, 'onfocus' => 0, 'onkeydown' => 0, 'onkeypress' => 0, 'onkeyup' => 0, 'onload' => 0, 'onmousedown' => 0, 'onmousemove' => 0, 'onmouseout' => 0, 'onmouseover' => 0, 'onmouseup' => 0, 'onreset' => 0, 'onselect' => 0, 'onsubmit' => 0, 'onunload' => 0, 'src' => 0, 'type' => 0, } ); # # Create the scrubber. # my $safe = HTML::Scrubber->new(); $safe->allow( @allow ); $safe->rules( @rules ); $safe->default( @default ); # deny HTML Comments $safe->comment(0); # # Update each paramater with the cleaned version # my $form = new CGI; foreach my $p ( $form->param() ) { my $val = $form->param($p); $val = $safe->scrub( $val ); $form->param( $p, $val ); } [download] Steve --- steve.org.uk	[reply] [d/l]
Re^2: How to remove HTML tags from text by nejcPirc (Acolyte) on Feb 04, 2005 at 12:16 UTC
Thanks a lot man:)	[reply]
Re: How to remove HTML tags from text by gellyfish (Monsignor) on Feb 04, 2005 at 12:22 UTC
Personally I would go with HTML::Parser: `#!/usr/bin/perl use strict; use warnings; use HTML::Parser; + my $data='abcd efgh<img src="http://test.com/image.gif">ijklmn'; my $parser = HTML::Parser->new( text_h => [ sub { $_[0]->{_data} .= $_ +[1]; },"self,dtext" ], start_document_h => [ sub { $_[0]->{_d +ata} = '';}, "self"]); $parser->parse($data); + print $parser->{_data};` [download] /J\	[reply] [d/l]
Re^2: How to remove HTML tags from text by holli (Abbot) on Feb 04, 2005 at 13:01 UTC
Alternative using Html::Tokeparser: `use strict; use HTML::TokeParser; # from file my $p = HTML::TokeParser->new("test.html") or die "Can't open: $!"; #from string #my $p = HTML::TokeParser->new(\"text1 <b> text2 </b> text3"); my $t; while (my $token = $p->get_token) { $t .= $token->[1] if $token->[0] eq "T"; } print $t;` [download] holli, regexed monk	[reply] [d/l]
Re: How to remove HTML tags from text by pelagic (Priest) on Feb 04, 2005 at 11:48 UTC
One (of many) possibilities is to remove whatever there is between "<" and ">" no matter whether it's really HTML. This could be achieved easily with Text::Balanced's "extract_bracketed". pelagic	[reply]
Re^2: How to remove HTML tags from text by nejcPirc (Acolyte) on Feb 04, 2005 at 11:54 UTC
Something like this? `$data=~s/<.?>//g; $data=~s/&lt.?&gt//g;` [download]	[reply] [d/l]
Re^3: How to remove HTML tags from text by pelagic (Priest) on Feb 04, 2005 at 12:17 UTC
Text::Balanced does it properly even if there is some funny stuff between the brackets. Look into the documentation for more info. But of course you can also use some regex to remove the strings in a simple way. pelagic	[reply]
Re^3: How to remove HTML tags from text by cLive ;-) (Prior) on Feb 05, 2005 at 00:00 UTC
nope. `$data = <<_END_; <script language="JavaScript"> alert("Boo!") </script > _END_` [download] A /s modifier would help a little, but doesn't solve everything. Eg, what about this tag? `$data = qq{ <img src="hello.jpg" alt="x>y" width="60" height="60" /> } +;` [download] Lesson: never treat HTML parsing as a "back of an envelope" exercise :) cLive ;-)	[reply] [d/l] [select]
Re: How to remove HTML tags from text by xdg (Monsignor) on Feb 04, 2005 at 16:18 UTC
Not surprisingly, CPAN has a module for this: HTML::Strip -xdg Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.	[reply]

Back to Seekers of Perl Wisdom