Handling HTML special characters correctly

cosmicperl has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Handling HTML special characters correctly by pc88mxer (Vicar) on Jul 02, 2008 at 18:45 UTC
The way that text is encoded in a form POST depends on the encoding of the HTML page containing the form. So it is always advisable to explicitly declare your page encoding. This should be done in the response header with: `Content-Type: text/html; charset=UTF-8` [download] and another way is use the `META` tag in your HTML output: `<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">` [download] Also, when you receive form parameters in a CGI script, you always need to decode them according to how they were encoded by the form: `use CGI qw(:standard); use Encode; ... my $name = Encode::decode('utf-8', scalar(param('name')));` [download] Now `$name` will contain code-points which is probably the most useful representation for your application. From there you can convert it to any other particular encoding when you need to. This article: Character Conversions from Browser to Database does a good job of explaining the issues involved.	[reply] [d/l] [select]
Re^2: Handling HTML special characters correctly by cosmicperl (Chaplain) on Jul 03, 2008 at 00:33 UTC
Thanks for the link, I'm sure it'll be very useful for my future projects :D	[reply]
Re: Handling HTML special characters correctly by cosmicperl (Chaplain) on Jul 02, 2008 at 18:33 UTC
Doh - Just found:- HTML::Entities	[reply]
Re^2: Handling HTML special characters correctly by LesleyB (Friar) on Jul 02, 2008 at 19:20 UTC
As I did yesterday, using it to convert C code to safe HTML text. As a general principle, always HTML-escape any data received from a form before displaying it again. If any data is to go on to a database or be used to access data in a database then that really must be SQL escaced to limit/prevent SQL injection attacks. These two procedures are not language specific. Always use the taint flag in perl CGI scripts i.e `#!/usr/bin/perl -T` or `#!/usr/bin/perl -wT` to also have warnings on. The way to untaint form data is to use regexps. This verifies the data is in the range expected.	[reply] [d/l] [select]
Re^2: Handling HTML special characters correctly by pc88mxer (Vicar) on Jul 02, 2008 at 19:50 UTC
Just want to point out that you don't need to convert the code-point \xA3 to `£` when outputting it. If you are only using latin-1 characters, you shouldn't have to use `encode_entities` on anything but the special HTML characters: <, >, &, and ". The code-point \xA3 is directly representable in latin-1 and utf-8 (and any other reasonable encoding you would use for your web page.) You only have to use `encode_entities` on those code-points which are not directly representable by the character set (encoding) used for your page.	[reply] [d/l] [select]
Re^3: Handling HTML special characters correctly by monarch (Priest) on Jul 02, 2008 at 22:11 UTC
..although "`\x{A3}`" in UTF-8 would be encoded as two bytes ("`\x{C2}\x{\xA3}`"). See UTF-8 encoding table. Update: removed superfluous parenthesis.	[reply] [d/l] [select]
Re^4: Handling HTML special characters correctly by cosmicperl (Chaplain) on Jul 03, 2008 at 00:25 UTC
Re^5: Handling HTML special characters correctly by tinita (Parson) on Jul 03, 2008 at 09:01 UTC
Re^5: Handling HTML special characters correctly by ikegami (Patriarch) on Jul 03, 2008 at 03:36 UTC