Strange behaviour with utf8 and wide chars

isync has asked for the wisdom of the Perl Monks concerning the following question:

I've got a few lines of perl (part of a larger script) which print two vars to a browser.

The script:

 $string1 = "This is greek character 'y': &#956;";
 $string2 = "Hello \x{263A}!\n";
 print "Content-Type: text/html; charset=utf-8\n\n";
 print "$string1 - $string2";
[download]

The problem is: whenever if do a print of the 'y' alone:

 $string1 = "This is greek character 'y': &#956;";
 print "Content-Type: text/html; charset=utf-8\n\n";
 print "$string1 - $string2";
[download]

it turns out OK in the browser.
But whenever just one wide utf-8 char is on that page, it breaks the 'y'!

 $string1 = "This is greek character 'y': &#956;";
 $string2 = "Hello \x{263A}!\n";
 print "Content-Type: text/html; charset=utf-8\n\n";
 print "$string1 - $string2";
[download]

This prints the smiley, but the 'y' becomes 'Ој '!! Any ideas? (mozilla still says the page is interpreted as utf-8).

Thanks!

running perl 5006001

20070323 Janitored by Corion: Changed PRE to code tags, as per Writeup Formatting Tips

Comment on Strange behaviour with utf8 and wide chars Select or Download Code

Replies are listed 'Best First'.
Re: Strange behaviour with utf8 and wide chars by Errto (Vicar) on Mar 22, 2007 at 17:54 UTC
Assuming this is running in CGI you can do `binmode STDOUT, ':utf8';` [download] and that ought to work. This is required because even though you are telling the browser the data is in UTF-8, you aren't telling Perl to output UTF-8 unless you add that line. See perlunicode and perlio for more. Update: It has been so long since I used 5.6 I had forgotten about the good point raised by almut. But for heaven's sake, Perl 5.8.1 was released in 2003.	[reply] [d/l]
Re^2: Strange behaviour with utf8 and wide chars by isync (Hermit) on Mar 22, 2007 at 23:36 UTC
Actually I was shocked myseld when I saw the output of $] (perl's version)! But I am running on managed hosting and I am not able to update perl. Also, the Encoding module is not available for me, and as it is based on a binary, I can't just upload it to working dir... So basically I am stuck here. The old Perl always worked for me, until recently when I startet to wrangle with XML and UTF8. Soon I found out I would need a western to utf8 transcoder - that's when I found out there is really no lightweight alternative to Encoding... Back to my question: The greek 'y' is in utf8 (at least it should be) (btw: I am getting it from utf8-SQL-db via XML, no problems so far) and printed that way, it turns out right -at least on its own. When I add the "smiley" (which is also utf8, as I understand it...) (generated via the xCode) it seems to change the $string1 to a different utf8 beast... Giving me a strange accented "I" and "1/4" while the "smiley" is smiling just beside the mess. Like you pointed out: It might/must be the problem that although I am telling the browser the output is utf8, perl doesn't output it as such (which I don't understand as everything in the html including both strings should be utf8). Is there ANY way to do this in 5.6.1, to really output utf8. The "pack" trick mentioned everywhere doesn't seem to work for me... And using "use encoding utf8" commands etc. gives an error...	[reply]
Re: Strange behaviour with utf8 and wide chars by almut (Canon) on Mar 22, 2007 at 18:35 UTC
First, it's probably a good idea to upgrade your Perl, if anyhow possible. In particular, if you need unicode support. For one, 5.6.1 had quite a different approach to unicode internally than current 5.8 versions. Secondly, it's rather ancient anyway, so you're less likely to receive community help with problems that are specific to 5.6. Just as an example, the `binmode STDOUT, ':utf8';` mentioned by Errto does not work in 5.6 (you'd get an "Unknown discipline ':utf8' ..." error). Apart from that, it would be interesting to know in what encoding you have specified your greek 'y' in `$string1`. Depending on that, you probably want to add a `use encoding ...` directive (also 5.8, btw) to tell Perl how to parse literal strings, regexes, etc. in the script itself (e.g. `use encoding 'utf8';` if the script is encoded in UTF-8).	[reply] [d/l] [select]
Re: Strange behaviour with utf8 and wide chars by Burak (Chaplain) on Mar 22, 2007 at 18:40 UTC
Either save your file with utf8 encoding & BOM or save it with utf8 encoding and a line with "use utf8;" (or "use encoding 'utf8';"). You can also do a manual fix: `# file name: test.pl # save it with utf8 encoding use Encode qw(decode_utf8); binmode STDOUT, ':utf8'; $string1 = "This is greek character 'y': м"; print "Content-Type: text/html; charset=utf-8\n\n"; print decode_utf8 $string1;` [download] You have to set an internal flag on utf8 data. See Encode for more info... Edit: I didn't see you've mentioned your perl version. you must upgrade to 5.8.x as almut said, if want serious unicode support...	[reply] [d/l]