Unicode & Locales

Mr. Muskrat has asked for the wisdom of the Perl Monks concerning the following question:

In Date::Language::*, I mentioned that I need to create some Date::Language modules since most of the work is already done.

Take a look at my old (extremely ugly & improperly encoded) script for the Greek example.

#!/usr/bin/perl
my ($tsec,$tmin,$thour,$tmday,$tmon,$tyear,$twd,$tyd,$tds) = localtime
+(time);
my (@tweekdays) = (
'&#922;&#965;&#961;&#953;&#945;&#954;&#942;',
'&#916;&#949;&#965;&#964;&#941;&#961;&#945;',
'&#932;&#961;&#943;&#964;&#951;',
'&#932;&#949;&#964;&#940;&#961;&#964;&#951;',
'&#928;&#941;&#956;&#960;&#964;&#951;',
'&#928;&#945;&#961;&#945;&#963;&#954;&#949;&#965;&#942;',
'&#931;&#940;&#946;&#946;&#945;&#964;&#959;'
);
my (@tmonths) = (
'&#921;&#945;&#957;&#959;&#965;&#945;&#961;&#943;&#959;&#965;',
'&#934;&#949;&#946;&#961;&#959;&#965;&#945;&#961;&#943;&#959;&#965;',
'&#924;&#945;&#961;&#964;&#943;&#959;&#965;',
'&#913;&#960;&#961;&#953;&#955;&#943;&#959;&#965;',
'&#924;&#945;&#970;&#959;&#965;',
'&#921;&#959;&#965;&#957;&#943;&#959;&#965;',
'&#921;&#959;&#965;&#955;&#943;&#959;&#965;',
'&#913;&#965;&#947;&#959;&#973;&#963;&#964;&#959;&#965;',
'&#931;&#949;&#960;&#964;&#949;&#956;&#946;&#961;&#943;&#959;&#965;',
'&#927;&#954;&#964;&#969;&#946;&#961;&#943;&#959;&#965;',
'&#925;&#959;&#949;&#956;&#946;&#961;&#943;&#959;&#965;',
'&#916;&#949;&#954;&#949;&#956;&#946;&#961;&#943;&#959;&#965;'
);
my ($tweekday) = $tweekdays[$twd];
my ($tmonth) = $tmonths[$tmon];
$tyear += 1900;
my ($tdate) = "$tweekday, $tmday $tmonth, $tyear";
print "Content-type: text/html\n\n";
print "$tdate\n";
[download]

It displays the date like this: Παρασκευή 25 Οκτωβρίου 2002.
In the web page that calls the script I have the following HTML:

<html lang="el">
<head>
<meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-12
+53">
[download]

Will using the proper unicode be enough? Or will I need to force a locale as well? I may not even be asking the right questions!

Comment on Unicode & Locales Select or Download Code

Replies are listed 'Best First'.
Re: Unicode & Locales by Thelonius (Priest) on Oct 25, 2002 at 20:41 UTC
The way you coded it, with numeric character references, will always work in HTML, no matter what the "charset" is. Those strings will not be interpreted as Unicode within Perl or pretty much anywhere else that you write them. (An XML processor will interpret them as a web browser would.) If you want to code the strings so that they can be used generally, you need to replace `'Κυ' with "\x{39A}\x{3C5}"` [download] and to `use utf8;` How these strings are used elsewhere is still complicated	[reply] [d/l] [select]
Re: Re: Unicode & Locales by Mr. Muskrat (Canon) on Oct 25, 2002 at 20:53 UTC
I have the unicode encodings for all of the characters and I did say that my example was impropery encoded. I really just need to know if I will have to somehow set the locale. Let me see if I can whip up some fresh, properly encoded perl code.	[reply]
(tye)Re: Unicode & Locales by tye (Sage) on Oct 26, 2002 at 21:30 UTC
If you are writing HTML, then `ί` is probably the best thing to write since it should always work. But that won't work outside of HTML (or XML or such similar things). For other cases you need to indicate that you are using Unicode, UTF8 in particular and then use `chr(943)` instead. How you indicate that you are using UTF8 would depend on what you are writing to. For a module, you would just document that your code returns UTF8 strings and it would be the responsibility of the programmer using your module to get that to display properly when they output it. - tye	[reply] [d/l] [select]
Re: Unicode & Locales by Mr. Muskrat (Canon) on Oct 25, 2002 at 22:16 UTC
Okay, here's a quick little example that just generates the Greek word for Sunday. `#!/usr/bin/perl use strict; use warnings; use utf8; our %g = ( Kappa => "\x{039a}", alpha => "\x{03b1}", iota => "\x{03b9}", rho => "\x{03c1}", upsilon => "\x{03c5}", etatonos => "\x{03ae}", kappa => "\x{03ba}", ); my $sunday = MakeWord(qw(Kappa upsilon rho iota alpha kappa etatonos)) +; print "Sunday in greek is $sunday.\n"; sub MakeWord { my $word; for (@_) { $word .= $g{$_}; } return $word; }` [download]	[reply] [d/l]
Re: Unicode & Locales by Mr. Muskrat (Canon) on Oct 25, 2002 at 20:21 UTC
Let me clarify a bit. The people using the script may or may not be creating CGI scripts. So when I start writing the Date::Language::* modules, will I need to worry about setting the locale or character set used? Or will I just need to code in the correct unicode?	[reply]