cosmicperl has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,
  I've come across a bit of a problem. I'll give a QnD example to show what I mean:-
<form action="script.cgi" method=post> <input name=textin value="&pound;"> <input type=submit> </form>
Now If I submit this form, my script won't get £, it'll get the ANSI character '£' (I think) that wont save or display properly in a unicode environment.
Do I need to parse all input and swap the ANSI characters for the &something; ones? Or is there another way?
I'm hoping that there is already a solution for this on CPAN but can't find it.

Thanks in advance

Lyle

Replies are listed 'Best First'.
Re: Handling HTML special characters correctly
by pc88mxer (Vicar) on Jul 02, 2008 at 18:45 UTC
    The way that text is encoded in a form POST depends on the encoding of the HTML page containing the form. So it is always advisable to explicitly declare your page encoding. This should be done in the response header with:
    Content-Type: text/html; charset=UTF-8
    and another way is use the META tag in your HTML output:
    <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">

    Also, when you receive form parameters in a CGI script, you always need to decode them according to how they were encoded by the form:

    use CGI qw(:standard); use Encode; ... my $name = Encode::decode('utf-8', scalar(param('name')));
    Now $name will contain code-points which is probably the most useful representation for your application. From there you can convert it to any other particular encoding when you need to.

    This article: Character Conversions from Browser to Database does a good job of explaining the issues involved.

      Thanks for the link, I'm sure it'll be very useful for my future projects :D
Re: Handling HTML special characters correctly
by cosmicperl (Chaplain) on Jul 02, 2008 at 18:33 UTC

      As I did yesterday, using it to convert C code to safe HTML text.

      As a general principle, always HTML-escape any data received from a form before displaying it again.

      If any data is to go on to a database or be used to access data in a database then that really must be SQL escaced to limit/prevent SQL injection attacks.

      These two procedures are not language specific.

      Always use the taint flag in perl CGI scripts i.e

      #!/usr/bin/perl -T

      or

      #!/usr/bin/perl -wT

      to also have warnings on.

      The way to untaint form data is to use regexps. This verifies the data is in the range expected.

      Just want to point out that you don't need to convert the code-point \xA3 to &pound; when outputting it. If you are only using latin-1 characters, you shouldn't have to use encode_entities on anything but the special HTML characters: <, >, &, and ".

      The code-point \xA3 is directly representable in latin-1 and utf-8 (and any other reasonable encoding you would use for your web page.) You only have to use encode_entities on those code-points which are not directly representable by the character set (encoding) used for your page.

        ..although "\x{A3}" in UTF-8 would be encoded as two bytes ("\x{C2}\x{\xA3}"). See UTF-8 encoding table.

        Update: removed superfluous parenthesis.