bug in utf8 handling?

jethro has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: bug in utf8 handling? by graff (Chancellor) on Oct 04, 2006 at 06:13 UTC
This was executed on Suse 10.0 and Ubuntu 6.06 with perl v5.8.7, completely 'utf8isized'. Does that mean you were using a utf8-aware xterm window (uxterm, gnuterm, or some such)? If perl really prints utf8 data to a tty that isn't set up to "do the right thing" with utf8 encoded characters, there's no telling what the output might look like. The sort of problem you're reporting is bound to be some side issue, not perl itself -- e.g. locale settings, as suggested by tye, or the kind of display window you're using, etc. It could also be a misunderstanding about the circumstances that induce perl to print utf8-encoded characters through an output file handle. I prefer to test these sorts of things with explicit code point values (I rarely try to put literal encoded characters into a script) and explicit encoding layers on the relevant file handle(s) (using either binmode or three-arg open). If you want to rely on "default behaviors", you do need to experiment heavily on what those behaviors entail, and the experiments will need to include things like the shell environment, the display application, available fonts, ... For the sake of confirming the behavior of the "uc" function on utf8 strings, I'd try it like this (with a utf8 capable terminal window): `perl -CS -e 'print "a\xe4m\n"' \| perl -CS -pe 'print; $_=uc'` [download] For me, that prints two lines: "aäm" followed by "AÄM" (which I am posting here as ~~utf8~~ iso-8859-1 -- if you don't see exactly three letters in each string, with two dots over the middle one, set your browser to use ~~utf8~~ that). (In the absence of a utf8 display, I'd pipe the output to some other process that would "hexify" the byte stream, so that I could confirm it against a code chart.)	[reply] [d/l]
Re: bug in utf8 handling? (utfm) by tye (Sage) on Oct 04, 2006 at 05:29 UTC
`perldoc -f uc` would tell you "Respects current LC_CTYPE locale if "use locale" in force". Add `-Mlocale` to your test cases and see if that "fixes" it. Perl doesn't uppercase other than a..z by default, regardless of any utf-8 concerns. Update: Interesting. My testing confirmed my suspicion (which had been confirmed by the fine manual), however, further testing showed that uc (without -Mlocale) appears to not impact accented letters encoded in Latin-1 but does impact the exact same accented letters if they are encoded in utf-8. Perhaps the fine manual could use an update? I also see now that you mention -Mlocale not having an impact. So I'll update my guess to be that Perl doesn't know that your string literals are meant to be utf-8 strings and so it is interpretting them as byte strings (or Latin-1 strings, depending on what you want to call Perl's non-utf-8 strings). The "use utf8;" tells Perl that your string literals are utf-8 and so it upcases correctly and also translates to Latin-1 when writing to STDOUT (since you haven't declared STDOUT as being a utf-8 file handle). encoding.pm notes that it declares STDOUT to be in a specific encoding. However, it also says that it interprets your source code as being in that encoding, not in utf-8. So that leaves me wondering how your source can be correctly interpretted as utf-8 and correctly interpretted as iso8859. Which also makes me curious if your test cases behave the same if saved to Perl script files instead of being typed into the command line. I'm also used to holding a belief that Perl's parser reads lines at a time and a pragma that changes how the source code is read won't have any effect until the next line. So whether or not you have a newline after something like "use utf8;" might make a difference and since source code entered on the command line isn't read by perl, perhaps that makes a difference as well. But mostly I'm tired (but not able to get to sleep) so some of my wondering will likely just leave me wondering why I wasn't wandering sooner. Update: Aha! Your source code is only using single-byte characters however bytes with the msb set get interpretted differently when those pragmas are used? Actually, I'd expect such "8-bit" characters to cause an error or warning when read after "use utf8;". You could assign your string literals to variables so that you could determine (in each case) if Perl interpretted the string as utf-8, in order to eliminate some guessing. - tye	[reply]
Re: bug in utf8 handling? by jethro (Monsignor) on Oct 04, 2006 at 15:08 UTC
@benizi: Yes, my display is utf8, I get 0000000 303 244 \n as output. @tye: The test cases are condensed versions of a script I'm writing. So the same happens in script-form (with 'use utf8' on its own line). In the script the strings were also first assigned to variables. But I didn't check them with unpack if you meant that. I also piped into files and then checked the content of the file. @graff: I got exactly the same output as you did. Didn't know the -C flag. Here results of further experiments (c3 a4 is the utf8 codepoint of ä, e4 is latin1 ä): echo ä \| od -t x1 c3 a4 0a <--- utf perl -e '$f=<>; print "$f"," \xe4"," \xc3\xa4"," ä\n" ' \| od -t x1 ä <--- my input to the <> c3 a4 0a 20 e4 20 c3 a4 20 c3 a4 0a <-- utf iso utf utf perl -CI -e '$f=<>; print "$f"," \xe4"," \xc3\xa4"," ä\n" ' \| od -t x +1 ä e4 0a 20 e4 20 c3 a4 20 c3 a4 0a <-- iso iso utf utf perl -CO -e '$f=<>; print "$f"," \xe4"," \xc3\xa4"," ä\n" ' \| od -t x +1 ä c3 83 c2 a4 0a 20 c3 a4 20 c3 83 c2 a4 20 c3 83 c2 a4 0a <-- utfgarba +ge utf utfgarbage utfgarbage perl -CS -e '$f=<>; print "$f"," \xe4"," \xc3\xa4"," ä\n" ' \| od -t x +1 c3 a4 0a 20 c3 a4 20 c3 83 c2 a4 20 c3 83 c2 a4 0a <-- utf utf utfga +rbage utfgarbage [download] This shows that the internal representation is in iso and it expects iso input and output. The 'ä' in the script is therefore not recognised as an 'ä' but as two iso chars and consequently can't get uppercased. Now when I tried the same with 'use utf8;' (to cut a long story short), I found out that it really only changes interpretation of literal script 'ä' without changing internal representation or any IO. Which means 'use utf8' works correctly, but somewhere there's a documentation and installation deficiency. It should be documented that 'use utf8;' should not be used on utf8 machines without an additional switch -CS. Furthermore on utf8 machines -CS should be enabled by default. Otherwise scripts written on iso machines break on utf8 machines and vice versa. I don't fancy changing all my scripts to include the -CS in the first line. Another thought: If 'use encoding utf8' changes IO formats like the -CS switch, using it would break backward compatibility to iso machines, which is not that desirable. Ideally the perl interpreter should know where it is running and handle the script accordingly. Which brings us back to locale, which sadly seems to be ignored at the moment.	[reply] [d/l]
Re^2: bug in utf8 handling? by Hue-Bond (Priest) on Oct 04, 2006 at 16:06 UTC
(c3 a4 is the utf8 codepoint of Ã¤ No. Codepoints are numbers. `c3 a4` is the UTF8 representation of codepoint 00E4: `$ perl -le'binmode STDOUT, ":utf8"; print "\x{00E4}";'\|od -c 0000000 303 244 \n 0000003` [download] Or, in a more legible form: `$ perl -CO -le'use charnames ":full"; print "\N{LATIN SMALL LETTER A W +ITH DIAERESIS}";'\|od -c 0000000 303 244 \n 0000003` [download] This shows that the internal representation is in iso You should not assume anything about the internal representation of perl strings. It may change in the future. It surprises me than no one suggested Encode yet. With it, you can decode strings to Perl internal format, mangle them at your will and encode them back when printing them out: `$ perl \|od -c use Encode; my $c = decode "latin1", "\xe4"; $c = uc $c; $c = chr (1 + ord $c); ## further mangling print encode "latin1", $c; __END__ 0000000 305 0000001 $ perl \|od -c use Encode; my $c = decode "latin1", "\xe4"; $c = uc $c; $c = chr (1 + ord $c); print encode "utf8", $c; ## <-- change here __END__ 0000000 303 205 0000002` [download] Furthermore on utf8 machines -CS should be enabled by default I thought that too but it ended being a bad idea. Yes, great for UTF-8 encoded text files but, what if you're working with a binary? Instead of using binmode `:raw` on binaries, I chose to drop `-C` and binmode `:utf8` on UTF-8 text files, like the rest of the world. And, if you've not noticed yet, there's no mention of `use utf8` in this post (well, almost ;^)). AIUI, utf8 serves a totally different purpose, namely: `use utf8; my $á = 42; print $á, "\n"; __END__ 42` [download] -- David Serrano	[reply] [d/l] [select]
Re^3: bug in utf8 handling? by jethro (Monsignor) on Oct 04, 2006 at 17:03 UTC
Well, the internal representation was important to find out what perl was doing and whether it was doing the right thing. But you are right with C3 A4 not being a codepoint. The Encode pragma is nice, but if I use that, I have to put encode in all places I use a literal string or print out something (the program I'm working on is 7000 lines at the moment). And more importantly I hardcode every detail of my environment in my script. If I do that, it should at least be only in one place. Binmode looks promising besides using -C, but both have the disadvantage of hardcoding the machine/platform into the script. But I see your point with binary files. Essentially I will have to create a subroutine that finds out whether I'm on an utf8 machine and read text files accordingly. About your last point about 'use utf8;': Sure that may be its purpose, but it also changes the meaning of string literals in the script. If I want to use 'ä' in a string literal and have it interpreted/manipulated correctly, I may have to use the pragma. Or use \x or \N anywhere I have an Umlaut, which is quite painful I guess. But thanks alot, your message was most informative. What I didn't understand was your comment about dropping -c on utf8 text files? Did you mean -C or something else?	[reply]
Re^4: bug in utf8 handling? by jethro (Monsignor) on Oct 04, 2006 at 18:24 UTC
Re^4: bug in utf8 handling? by graff (Chancellor) on Oct 16, 2006 at 09:02 UTC
A reply falls below the community's threshold of quality. You may see it by logging in.