in reply to Regexp to ignore HTML tags
Quite likely others are right that you should use an existing HTML parser. If for some reason this is an overkill to you or if the string is not really a valid HTML, you may try something like this:
(This function escapes all characters special to HTML that are not part of valid HTML tags or entities.)sub PolishHTML { my $str = shift; if ($AllowXHTML) { $str =~ s{(.*?)(&\w+;|&#\d+;|<\w[\w\d]*(?:\s+\w[\w\d]*(?:\s*=\ +s*(?:[^" '><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*/?>|</\w[\w\d]*>|$) +} {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-~ +').$2}gem; } else { $str =~ s{(.*?)(&\w+;|&#\d+;|<\w[\w\d]*(?:\s+\w[\w\d]*(?:\s*=\ +s*(?:[^" '><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*>|</\w[\w\d]*>|$)} {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-~ +').$2}gem; } return $str; }
Jenda
Always code as if the guy who ends up maintaining your code
will be a violent psychopath who knows where you live.
-- Rick Osborne
Edit by castaway: Closed small tag in signature
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Regexp to ignore HTML tags
by markhoy (Novice) on Apr 01, 2003 at 14:21 UTC |