in reply to Code does Remove Html

This code will remove all html tags.
Test code
my $value = qq(<html> <head> <title>HTML Document?</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1 +"> </head> <body bgcolor="#FFFFFF" text="#000000"> Some Text. <img src="/super/long/path/string/that/the/author/puts/on/a/separate/line. +jpg" alt="missing image" > </body> </html>); $value =~ s/<(([^ >]|\n|\s\w)*)>/ /gso; print $value;

What I like to do is just escape some parts to make the form text harmless and characters viewable when printed in the browser.
Example code
$value =~ s{&}{&amp;}gso; $value =~ s{"}{&quot;}gso; $value =~ s{ }{ \&nbsp;}gso; $value =~ s{<}{&lt;}gso; $value =~ s{>}{&gt;}gso; $value =~ s{\)}{&#41;}gso; $value =~ s{\(}{&#40;}gso; $value =~ s{\t}{ \&nbsp; \&nbsp; \&nbsp;}gso; $value =~ s{\n}{<br>}gso; print $value;

Replies are listed 'Best First'.
Re^2: Code does Remove Html (hole)
by tye (Sage) on Dec 17, 2006 at 16:07 UTC

    If you "remove HTML" with a regex like that, then I can still get whatever HTML I want in like so:

    <a <b>href="www.example.com">Cheap Viagra!</</b>a> <script<b>> alert("CHEAP VIAGRA!") </script</b>>

    Escaping is a much better idea.

    - tye        

      Escaping is a much better idea.

      Yes it is!
      If done right the text enterd is safe and you dont loose any of the data.
      doing the other way lots of data can be lost and the code could be bybassed.


      Updated: I just had to build a code to remove the text in tye's post.
      This is what I came up with
      $value =~ s/<(([^ >]|\n|\s\/|\s\S\S)*)>/ /gso;

      Ya that code kinda works better but it still removes parts that it should not and like tye stated "Escaping" is the better way to go.