in reply to Handling HTML special characters correctly

Doh - Just found:- HTML::Entities
  • Comment on Re: Handling HTML special characters correctly

Replies are listed 'Best First'.
Re^2: Handling HTML special characters correctly
by LesleyB (Friar) on Jul 02, 2008 at 19:20 UTC

    As I did yesterday, using it to convert C code to safe HTML text.

    As a general principle, always HTML-escape any data received from a form before displaying it again.

    If any data is to go on to a database or be used to access data in a database then that really must be SQL escaced to limit/prevent SQL injection attacks.

    These two procedures are not language specific.

    Always use the taint flag in perl CGI scripts i.e

    #!/usr/bin/perl -T

    or

    #!/usr/bin/perl -wT

    to also have warnings on.

    The way to untaint form data is to use regexps. This verifies the data is in the range expected.

Re^2: Handling HTML special characters correctly
by pc88mxer (Vicar) on Jul 02, 2008 at 19:50 UTC
    Just want to point out that you don't need to convert the code-point \xA3 to &pound; when outputting it. If you are only using latin-1 characters, you shouldn't have to use encode_entities on anything but the special HTML characters: <, >, &, and ".

    The code-point \xA3 is directly representable in latin-1 and utf-8 (and any other reasonable encoding you would use for your web page.) You only have to use encode_entities on those code-points which are not directly representable by the character set (encoding) used for your page.

      ..although "\x{A3}" in UTF-8 would be encoded as two bytes ("\x{C2}\x{\xA3}"). See UTF-8 encoding table.

      Update: removed superfluous parenthesis.

        £ never used to cause me a problem on the old RH9. But these days most web servers seem to be set to en_us.UTF-8, where outputting £ will give you a nasty ? in the browser, needs to be &pound; these days.

        On a side note just noticed something annoying about HTML::Entities, if your input is already encoded, such as &pound;, you'll get &amp;pound;, thought it would have checked for encoded characters and skipped them?


        Lyle