Re: Removing HTML tags from a string
by chip (Curate) on Nov 17, 2001 at 08:02 UTC
|
Rather than removing the tags, you could easily use HTML entity encoding to make them appear literally as the user typed them:
use HTML::Entities;
print encode_entities("<b>hi</b>");
<b>hi</b>
-- Chip Salzenberg, Free-Floating Agent of Chaos | [reply] [Watch: Dir/Any] [d/l] |
|
foreach (@line){
$_ =~ s/\</\<\;/g;
$_ =~ s/\>/\>\;/g;
}
work just as well?
John J Reiser
newrisedesigns.com | [reply] [Watch: Dir/Any] [d/l] |
|
Well, if that's what I meant, I probably would have typed
something like this:
for (@line) {
s/</</g;
s/>/>/g;
}
But to answer your question: It's actually less brain drain
to use a tested and complete module than to write equivalent
code -- especially when the code isn't equivalent at all....
Or were you under the impression that the only dangerous
characters for HTML rendering are "<" and ">"?
-- Chip Salzenberg, Free-Floating Agent of Chaos | [reply] [Watch: Dir/Any] [d/l] |
Re: Removing HTML tags from a sting
by Hero Zzyzzx (Curate) on Nov 17, 2001 at 09:03 UTC
|
My current fav is HTML::TagFilter, it allows you to easily strip all HTML or to allow/disallow certain tags based on innumerable attributes. It's a subclass of HTML::Parser. This is particularly useful if you want to allow certain tags, such as for text formatting.
use HTML::TagFilter;
# This will strip all HTML from whatever you pass it.
my $filter=HTML::TagFilter->new(allow=>{},strip_comments=>1);
#fieldvalue appears magically from somewhere;
$fieldvalue=$filter->filter($fieldvalue);
-Any sufficiently advanced technology is indistinguishable from doubletalk. | [reply] [Watch: Dir/Any] [d/l] |
A reply falls below the community's threshold of quality. You may see it by logging in. |
Re: Removing HTML tags from a sting
by data64 (Chaplain) on Nov 17, 2001 at 06:47 UTC
|
| [reply] [Watch: Dir/Any] |
Re: Removing HTML tags from a sting
by Anonymous Monk on Nov 17, 2001 at 08:21 UTC
|
| [reply] [Watch: Dir/Any] |
Re: Removing HTML tags from a sting
by Elgon (Curate) on Nov 18, 2001 at 02:32 UTC
|
Baz,
This kind of thing is very important when using Perl for CGI and certain other applications (use SuperSearch to check up on the -T switch or taintchecking.)
Basically the way to do it is using regexps or one of the various modules listed above. The key is the philosophy with which you approach the problem - the way I think it should be done is that ANYTHING which is not expressly allowed should be forbidden: If, for example, you want to use some entered text for a message book then you should strip ALL characters except A-Z, a-z, 0-9 ,!. and maybe? This heads off just about any kind of problem because no tags such as the potential nasty <javascript> javascript code</javascript> can get through. If in doubt, write some code and post it here and ask for comment, I'm sure that the gods will not be upset.
Hope this helps.
"A nerd is someone who knows the difference between a compiled and an interpreted language, whereas a geek is a person who can explain it cogently to a non-geek over a couple of beers" - Elgon
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Removing HTML tags from a sting
by kilinrax (Deacon) on Jun 24, 2003 at 17:22 UTC
|
I'd suggest you use HTML::Strip (though I'm obviously biased, being the author):
use HTML::Strip;
my $hs = HTML::Strip->new();
my $clean_text = $hs->parse( $raw_html );
$hs->eof;
| [reply] [Watch: Dir/Any] [d/l] |