Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Removing HTML tags from a sting

by Baz (Friar)
on Nov 17, 2001 at 06:27 UTC ( [id://125981]=perlquestion: print w/replies, xml ) Need Help??

Baz has asked for the wisdom of the Perl Monks concerning the following question:

Some smart ass(not disclosing any names, kwoff) recently demonstrated to me the dangers of forms. Suppose someone decided to enter there name as : WHO LET THE DOGS OUT !, is there any method available in perl(maybe cgi.pm) that could insure that the html tags are removed so that text is only displayed in the format i decide it to be displayed in. Thanks.

Replies are listed 'Best First'.
Re: Removing HTML tags from a string
by chip (Curate) on Nov 17, 2001 at 08:02 UTC
    Rather than removing the tags, you could easily use HTML entity encoding to make them appear literally as the user typed them:

    use HTML::Entities; print encode_entities("<b>hi</b>"); &lt;b&gt;hi&lt;/b&gt;

        -- Chip Salzenberg, Free-Floating Agent of Chaos

      wouldn't just a simple

      foreach (@line){ $_ =~ s/\</\&lt\;/g; $_ =~ s/\>/\&gt\;/g; }

      work just as well?

      John J Reiser
      newrisedesigns.com
        Well, if that's what I meant, I probably would have typed something like this:

        for (@line) { s/</&lt;/g; s/>/&gt;/g; }

        But to answer your question: It's actually less brain drain to use a tested and complete module than to write equivalent code -- especially when the code isn't equivalent at all.... Or were you under the impression that the only dangerous characters for HTML rendering are "<" and ">"?

            -- Chip Salzenberg, Free-Floating Agent of Chaos

Re: Removing HTML tags from a sting
by Hero Zzyzzx (Curate) on Nov 17, 2001 at 09:03 UTC

    My current fav is HTML::TagFilter, it allows you to easily strip all HTML or to allow/disallow certain tags based on innumerable attributes. It's a subclass of HTML::Parser. This is particularly useful if you want to allow certain tags, such as for text formatting.

    use HTML::TagFilter; # This will strip all HTML from whatever you pass it. my $filter=HTML::TagFilter->new(allow=>{},strip_comments=>1); #fieldvalue appears magically from somewhere; $fieldvalue=$filter->filter($fieldvalue);

    -Any sufficiently advanced technology is
    indistinguishable from doubletalk.

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Removing HTML tags from a sting
by data64 (Chaplain) on Nov 17, 2001 at 06:47 UTC
Re: Removing HTML tags from a sting
by Anonymous Monk on Nov 17, 2001 at 08:21 UTC
    I found that HTML::TokeParser works great for this.

    $parser = HTML::TokeParser->new(\$html_messege);
    $wohtml = $parser->get_text("BODY");

Re: Removing HTML tags from a sting
by Elgon (Curate) on Nov 18, 2001 at 02:32 UTC

    Baz,

    This kind of thing is very important when using Perl for CGI and certain other applications (use SuperSearch to check up on the -T switch or taintchecking.)

    Basically the way to do it is using regexps or one of the various modules listed above. The key is the philosophy with which you approach the problem - the way I think it should be done is that ANYTHING which is not expressly allowed should be forbidden: If, for example, you want to use some entered text for a message book then you should strip ALL characters except A-Z, a-z, 0-9 ,!. and maybe? This heads off just about any kind of problem because no tags such as the potential nasty <javascript> javascript code</javascript> can get through. If in doubt, write some code and post it here and ask for comment, I'm sure that the gods will not be upset.

    Hope this helps.

    "A nerd is someone who knows the difference between a compiled and an interpreted language, whereas a geek is a person who can explain it cogently to a non-geek over a couple of beers" - Elgon

Re: Removing HTML tags from a sting
by kilinrax (Deacon) on Jun 24, 2003 at 17:22 UTC
    I'd suggest you use HTML::Strip (though I'm obviously biased, being the author):
    use HTML::Strip; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://125981]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2024-03-29 11:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found