Re: Regexp to ignore HTML tags

Quite likely others are right that you should use an existing HTML parser. If for some reason this is an overkill to you or if the string is not really a valid HTML, you may try something like this:

sub PolishHTML {
    my $str = shift;
    if ($AllowXHTML) {
        $str =~ s{(.*?)(&\w+;|&#\d+;|<\w[\w\d]*(?:\s+\w[\w\d]*(?:\s*=\
+s*(?:[^" '><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*/?>|</\w[\w\d]*>|$)
+}
                 {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-~
+').$2}gem;
    } else {
        $str =~ s{(.*?)(&\w+;|&#\d+;|<\w[\w\d]*(?:\s+\w[\w\d]*(?:\s*=\
+s*(?:[^" '><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*>|</\w[\w\d]*>|$)}
                 {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-~
+').$2}gem;
    }
    return $str;
}
[download]

(This function escapes all characters special to HTML that are not part of valid HTML tags or entities.)

Jenda
Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
-- Rick Osborne

Edit by castaway: Closed small tag in signature

Comment on Re: Regexp to ignore HTML tags Download Code

Replies are listed 'Best First'.
Regexp to ignore HTML tags by markhoy (Novice) on Apr 01, 2003 at 14:21 UTC
Thanks All!! Will give all suggestions a try ASAP.	[reply]