Unicode on Win2k

Mike McClellan has asked for the wisdom of the Perl Monks concerning the following question:

On a Japanese Win2k box I run

print chr(0x141);
[download]

print "\x{141}";
[download]

print chr(321);
[download]

All print the same character but when I run the same code a second time on the same box I get a different character (a dot). Can anyone tell my why? How to reset it? use utf8; does not change the behavior I eventually want to have code like...

if($someInputString eq "Unicode String Here"){
...
}
[download]

Comment on Unicode on Win2k Select or Download Code

Replies are listed 'Best First'.
(Ovid) Re: Unicode on Win2k by Ovid (Cardinal) on Jul 26, 2000 at 22:36 UTC
I don't know the answer to your question, but I'll make some guesses and maybe they'll help. The reason that `print chr(0x141);` and `print chr(321);` print the same thing is because 321 in decimal is equivalent to 141 in hex. The statement `print "\x{141}";` gives me an `Illegal hex digit ignored...` when I use the -w switch and simply prints `{141}`. Typically, with Standard ASCII, there are no characters above 128 decimal (or 256 if using Extended ASCII). When you attempt to `print chr(321);`, Perl simply drops the bits that are irrelevant and prints an "A", which is 65 in ASCII (256+65=321). Since Japanese characters (and Unicode) are represented by two bytes instead of one, perhaps what is happening is you have some weird buffering problem where an extra byte gets moved to STDOUT, thereby throwing off the bytes that follow. Typically, Perl buffers output to STDOUT so you get things printed only after enough "stuff" is in STDOUT. This is a performance improvement, but can cause problems if things are being written to STDOUT slowly or unusually. Try undefining `$\|`, which causes an autoflush on output, and seeing what you get. Maybe you'll autoflush an errant byte? Cheers, Ovid	[reply]

Replies are listed 'Best First'.

(Ovid) Re: Unicode on Win2k
by Ovid (Cardinal) on Jul 26, 2000 at 22:36 UTC

The reason that print chr(0x141); and print chr(321); print the same thing is because 321 in decimal is equivalent to 141 in hex. The statement print "\x{141}"; gives me an Illegal hex digit ignored... when I use the -w switch and simply prints {141}.

Typically, with Standard ASCII, there are no characters above 128 decimal (or 256 if using Extended ASCII). When you attempt to print chr(321);, Perl simply drops the bits that are irrelevant and prints an "A", which is 65 in ASCII (256+65=321).

Since Japanese characters (and Unicode) are represented by two bytes instead of one, perhaps what is happening is you have some weird buffering problem where an extra byte gets moved to STDOUT, thereby throwing off the bytes that follow. Typically, Perl buffers output to STDOUT so you get things printed only after enough "stuff" is in STDOUT. This is a performance improvement, but can cause problems if things are being written to STDOUT slowly or unusually. Try undefining $|, which causes an autoflush on output, and seeing what you get. Maybe you'll autoflush an errant byte?

Cheers,
Ovid

[reply]