Quite likely others are right that you should use an existing HTML parser. If for some reason this is an overkill to you or if the string is not really a valid HTML, you may try something like this:
(This function escapes all characters special to HTML that are not part of valid HTML tags or entities.)sub PolishHTML { my $str = shift; if ($AllowXHTML) { $str =~ s{(.*?)(&\w+;|&#\d+;|<\w[\w\d]*(?:\s+\w[\w\d]*(?:\s*=\ +s*(?:[^" '><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*/?>|</\w[\w\d]*>|$) +} {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-~ +').$2}gem; } else { $str =~ s{(.*?)(&\w+;|&#\d+;|<\w[\w\d]*(?:\s+\w[\w\d]*(?:\s*=\ +s*(?:[^" '><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*>|</\w[\w\d]*>|$)} {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-~ +').$2}gem; } return $str; }
Jenda
Always code as if the guy who ends up maintaining your code
will be a violent psychopath who knows where you live.
-- Rick Osborne
Edit by castaway: Closed small tag in signature
In reply to Re: Regexp to ignore HTML tags
by Jenda
in thread Regexp to ignore HTML tags
by markhoy
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |