http://qs1969.pair.com?node_id=428015

nejcPirc has asked for the wisdom of the Perl Monks concerning the following question:

Hello perl monks,

Can you explain me how to filter text data which i get from my form...

With my CGI I colect data from my form(textarea for example)...
Now i have colected data in my $data='abcd efgh<img src="http://test.com/image.gif">ijklmn';

Now I would like to use some filter to clear out HTML TAGS like "<img src="http://test.com/image.gif">" from my $data, so on the end it shoul contain no html tags, only text: $data='abcd efgh ijklmn';

Thanks a lot, Nejc

Replies are listed 'Best First'.
Re: How to remove HTML tags from text
by skx (Parson) on Feb 04, 2005 at 12:10 UTC

    You can do better than just replacing the "<" and ">" characters - as these do not prevent all attacks.

    Have a look at HTML::Scrubber and HTML::Filter.

    Using the first is as simple as this:

    my @allow = qw[ ul li ol p br hr b a i pre blockquote tt dl dd dt ]; my @rules = ( script => 0, img => { src => qr{^(http://)}i, # only absolute image links allowed alt => 1, # alt attribute allowed '*' => 0, # deny all other attributes }, a => { href => 1, # HREF title => 1, # ALT attribute allowed rel => 1, # Link relationship '*' => 0, # deny all other attributes }, ); # my @default = ( 0 => # default rule, deny all tags { '*' => 1, # default rule, allow all attributes 'href' => qr{^(?!(?:java)?script)}i, 'src' => qr{^(?!(?:java)?script)}i, 'cite' => '(?i-xsm:^(?!(?:java)?script))', 'language' => 0, 'name' => 1, # could be sneaky, but hey ;) 'onblur' => 0, 'onchange' => 0, 'onclick' => 0, 'ondblclick' => 0, 'onerror' => 0, 'onfocus' => 0, 'onkeydown' => 0, 'onkeypress' => 0, 'onkeyup' => 0, 'onload' => 0, 'onmousedown' => 0, 'onmousemove' => 0, 'onmouseout' => 0, 'onmouseover' => 0, 'onmouseup' => 0, 'onreset' => 0, 'onselect' => 0, 'onsubmit' => 0, 'onunload' => 0, 'src' => 0, 'type' => 0, } ); # # Create the scrubber. # my $safe = HTML::Scrubber->new(); $safe->allow( @allow ); $safe->rules( @rules ); $safe->default( @default ); # deny HTML Comments $safe->comment(0); # # Update each paramater with the cleaned version # my $form = new CGI; foreach my $p ( $form->param() ) { my $val = $form->param($p); $val = $safe->scrub( $val ); $form->param( $p, $val ); }
    Steve
    ---
    steve.org.uk
      Thanks a lot man:)
Re: How to remove HTML tags from text
by gellyfish (Monsignor) on Feb 04, 2005 at 12:22 UTC

    Personally I would go with HTML::Parser:

    #!/usr/bin/perl use strict; use warnings; use HTML::Parser; + my $data='abcd efgh<img src="http://test.com/image.gif">ijklmn'; my $parser = HTML::Parser->new( text_h => [ sub { $_[0]->{_data} .= $_ +[1]; },"self,dtext" ], start_document_h => [ sub { $_[0]->{_d +ata} = '';}, "self"]); $parser->parse($data); + print $parser->{_data};

    /J\

      Alternative using Html::Tokeparser:
      use strict; use HTML::TokeParser; # from file my $p = HTML::TokeParser->new("test.html") or die "Can't open: $!"; #from string #my $p = HTML::TokeParser->new(\"text1 <b> text2 </b> text3"); my $t; while (my $token = $p->get_token) { $t .= $token->[1] if $token->[0] eq "T"; } print $t;

      holli, regexed monk
Re: How to remove HTML tags from text
by pelagic (Priest) on Feb 04, 2005 at 11:48 UTC
    One (of many) possibilities is to remove whatever there is between "<" and ">" no matter whether it's really HTML. This could be achieved easily with Text::Balanced's "extract_bracketed".

    pelagic
      Something like this?
      $data=~s/<.*?>//g; $data=~s/&lt.*?&gt//g;
        Text::Balanced does it properly even if there is some funny stuff between the brackets. Look into the documentation for more info. But of course you can also use some regex to remove the strings in a simple way.

        pelagic
        nope.
        $data = <<_END_; <script language="JavaScript"> alert("Boo!") </script > _END_
        A /s modifier would help a little, but doesn't solve everything. Eg, what about this tag?
        $data = qq{ <img src="hello.jpg" alt="x>y" width="60" height="60" /> } +;

        Lesson: never treat HTML parsing as a "back of an envelope" exercise :)

        cLive ;-)

Re: How to remove HTML tags from text
by xdg (Monsignor) on Feb 04, 2005 at 16:18 UTC

    Not surprisingly, CPAN has a module for this: HTML::Strip

    -xdg

    Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.