Re: Code does Remove Html

This code will remove all html tags.
Test code

my $value = qq(<html>
<head>
<title>HTML Document?</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1
+">
</head>

<body bgcolor="#FFFFFF" text="#000000">
 Some Text.
 <img
 src="/super/long/path/string/that/the/author/puts/on/a/separate/line.
+jpg"
 alt="missing image"
>
</body>
</html>);
$value =~ s/<(([^ >]|\n|\s\w)*)>/ /gso;

print $value;
[download]

What I like to do is just escape some parts to make the form text harmless and characters viewable when printed in the browser.
Example code

        $value =~ s{&}{&amp;}gso;
        $value =~ s{"}{&quot;}gso;
        $value =~ s{  }{ \&nbsp;}gso;
        $value =~ s{<}{&lt;}gso;
        $value =~ s{>}{&gt;}gso;
        $value =~ s{\)}{&#41;}gso;
        $value =~ s{\(}{&#40;}gso;
        $value =~ s{\t}{ \&nbsp; \&nbsp; \&nbsp;}gso;
        $value =~ s{\n}{<br>}gso;
        print $value;
[download]

Comment on Re: Code does Remove Html Select or Download Code

Replies are listed 'Best First'.
Re^2: Code does Remove Html (hole) by tye (Sage) on Dec 17, 2006 at 16:07 UTC
If you "remove HTML" with a regex like that, then I can still get whatever HTML I want in like so: `<a <b>href="www.example.com">Cheap Viagra!</</b>a> <script<b>> alert("CHEAP VIAGRA!") </script</b>>` [download] Escaping is a much better idea. - tye	[reply] [d/l]
Re^3: Code does Remove Html (hole) by SFLEX (Chaplain) on Dec 17, 2006 at 17:40 UTC
Escaping is a much better idea. Yes it is! If done right the text enterd is safe and you dont loose any of the data. doing the other way lots of data can be lost and the code could be bybassed. Updated: I just had to build a code to remove the text in tye's post. This is what I came up with `$value =~ s/<(([^ >]\|\n\|\s\/\|\s\S\S))>/ /gso;` Ya that code kinda works better but it still removes parts that it should not and like tye stated "Escaping*" is the better way to go.	[reply] [d/l]