amonroy has asked for the wisdom of the Perl Monks concerning the following question:

I am using HTML::Entities to prevent cross-site scripting. Basically those $variables that come from the user's input are passed to HTML::Entities::encode_entities() before sending them back to the browser. The user's input can be UTF-8 data. When I encode UTF-8 characters they don't show up properly in the browser, if I don't encode them they are presented just fine.

The solution I have is to HTML encode only non-UTF-8 characters. So for each character I have to check first if it is UTF-8 or not using String::Multibyte.

Is there a better way to do this? I was hoping HTML::Entities to handle this, is there a better module?

Thanks,
-Andrés

Replies are listed 'Best First'.
Re: HTML::Entities and UTF-8
by iburrell (Chaplain) on Apr 14, 2004 at 19:36 UTC
    What version of Perl are you using? With Perl 5.6, HTML::Entitites seems to encode the UTF-8 bytes. With Perl 5.8, it encodes Unicode strings properly.
    $string = encode_entitites("\x{263A}"); # $string = "☺";

      I'm using version 5.8.0.

      I am sending UTF-8 in the character set parameter of the HTTP header. When I print the $variables as they are, e.g a chinese character, the browser renders them just fine. But if I apply the HTML::Entity encoding then it shows gibbrish.

      Now that I think about it, maybe I should not be sending UTF-8 in the HTTP header when I'm using HTML::Entities.

      Thanks a lot.

Re: HTML::Entities and UTF-8
by iburrell (Chaplain) on Apr 14, 2004 at 19:45 UTC
    Instead of searching for multibyte characters, you can change which characters encode_entities encodes to leave the high-bit characters. This will preserve all the UTF-8 bytes. As long as your HTML is marked as encoded with UTF-8, then the browser should display it properly.

    For example, this just encoded the special characters.

    $string = encode_entities($a, "<>&");