Re: bug in utf8 handling?

This was executed on Suse 10.0 and Ubuntu 6.06 with perl v5.8.7, completely 'utf8isized'.

Does that mean you were using a utf8-aware xterm window (uxterm, gnuterm, or some such)? If perl really prints utf8 data to a tty that isn't set up to "do the right thing" with utf8 encoded characters, there's no telling what the output might look like.

The sort of problem you're reporting is bound to be some side issue, not perl itself -- e.g. locale settings, as suggested by tye, or the kind of display window you're using, etc. It could also be a misunderstanding about the circumstances that induce perl to print utf8-encoded characters through an output file handle.

I prefer to test these sorts of things with explicit code point values (I rarely try to put literal encoded characters into a script) and explicit encoding layers on the relevant file handle(s) (using either binmode or three-arg open).

If you want to rely on "default behaviors", you do need to experiment heavily on what those behaviors entail, and the experiments will need to include things like the shell environment, the display application, available fonts, ...

For the sake of confirming the behavior of the "uc" function on utf8 strings, I'd try it like this (with a utf8 capable terminal window):

perl -CS -e 'print "a\xe4m\n"' | perl -CS -pe 'print; $_=uc'
[download]

For me, that prints two lines: "aäm" followed by "AÄM" (which I am posting here as ~~utf8~~ iso-8859-1 -- if you don't see exactly three letters in each string, with two dots over the middle one, set your browser to use ~~utf8~~ that).

(In the absence of a utf8 display, I'd pipe the output to some other process that would "hexify" the byte stream, so that I could confirm it against a code chart.)

Comment on Re: bug in utf8 handling? Download Code