in reply to bug in utf8 handling?
@tye: The test cases are condensed versions of a script I'm writing. So the same happens in script-form (with 'use utf8' on its own line). In the script the strings were also first assigned to variables. But I didn't check them with unpack if you meant that. I also piped into files and then checked the content of the file.
@graff: I got exactly the same output as you did. Didn't know the -C flag. Here results of further experiments (c3 a4 is the utf8 codepoint of ä, e4 is latin1 ä):
This shows that the internal representation is in iso and it expects iso input and output. The 'ä' in the script is therefore not recognised as an 'ä' but as two iso chars and consequently can't get uppercased.echo ä | od -t x1 c3 a4 0a <--- utf perl -e '$f=<>; print "$f"," \xe4"," \xc3\xa4"," ä\n" ' | od -t x1 ä <--- my input to the <> c3 a4 0a 20 e4 20 c3 a4 20 c3 a4 0a <-- utf iso utf utf perl -CI -e '$f=<>; print "$f"," \xe4"," \xc3\xa4"," ä\n" ' | od -t x +1 ä e4 0a 20 e4 20 c3 a4 20 c3 a4 0a <-- iso iso utf utf perl -CO -e '$f=<>; print "$f"," \xe4"," \xc3\xa4"," ä\n" ' | od -t x +1 ä c3 83 c2 a4 0a 20 c3 a4 20 c3 83 c2 a4 20 c3 83 c2 a4 0a <-- utfgarba +ge utf utfgarbage utfgarbage perl -CS -e '$f=<>; print "$f"," \xe4"," \xc3\xa4"," ä\n" ' | od -t x +1 c3 a4 0a 20 c3 a4 20 c3 83 c2 a4 20 c3 83 c2 a4 0a <-- utf utf utfga +rbage utfgarbage
Now when I tried the same with 'use utf8;' (to cut a long story short), I found out that it really only changes interpretation of literal script 'ä' without changing internal representation or any IO.
Which means 'use utf8' works correctly, but somewhere there's a documentation and installation deficiency.
It should be documented that 'use utf8;' should not be used on utf8 machines without an additional switch -CS.
Furthermore on utf8 machines -CS should be enabled by default. Otherwise scripts written on iso machines break on utf8 machines and vice versa. I don't fancy changing all my scripts to include the -CS in the first line.
Another thought: If 'use encoding utf8' changes IO formats like the -CS switch, using it would break backward compatibility to iso machines, which is not that desirable. Ideally the perl interpreter should know where it is running and handle the script accordingly. Which brings us back to locale, which sadly seems to be ignored at the moment.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: bug in utf8 handling?
by Hue-Bond (Priest) on Oct 04, 2006 at 16:06 UTC | |
by jethro (Monsignor) on Oct 04, 2006 at 17:03 UTC | |
by jethro (Monsignor) on Oct 04, 2006 at 18:24 UTC | |
by graff (Chancellor) on Oct 16, 2006 at 09:02 UTC |