jethro has asked for the wisdom of the Perl Monks concerning the following question:
This was executed on Suse 10.0 and Ubuntu 6.06 with perl v5.8.7, completely 'utf8isized'. So obviously the pragma utf8 changes output to iso8859, on an utf8 system.> perl -e ' print "ä", uc "äm\n" ' ääm <--- utf8, but no uppercase ä > perl -e ' use utf8; print "ä", uc "äm\n" ' m <--- iso8859-1 string. When converted with iconv, I get 'äÄ' !!
adding 'use locale;' changes nothing (and I checked that the locale was correct, de_DE.utf8), but 'use encoding utf8;' instead of 'use utf8;' works correctly.
After reading the perldocs, especially perlunicode, I'm still not sure whether one or both of them are bugs, but both seem at least surprising behaviour.
Bug or not?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: bug in utf8 handling?
by graff (Chancellor) on Oct 04, 2006 at 06:13 UTC | |
Does that mean you were using a utf8-aware xterm window (uxterm, gnuterm, or some such)? If perl really prints utf8 data to a tty that isn't set up to "do the right thing" with utf8 encoded characters, there's no telling what the output might look like. The sort of problem you're reporting is bound to be some side issue, not perl itself -- e.g. locale settings, as suggested by tye, or the kind of display window you're using, etc. It could also be a misunderstanding about the circumstances that induce perl to print utf8-encoded characters through an output file handle. I prefer to test these sorts of things with explicit code point values (I rarely try to put literal encoded characters into a script) and explicit encoding layers on the relevant file handle(s) (using either binmode or three-arg open). If you want to rely on "default behaviors", you do need to experiment heavily on what those behaviors entail, and the experiments will need to include things like the shell environment, the display application, available fonts, ... For the sake of confirming the behavior of the "uc" function on utf8 strings, I'd try it like this (with a utf8 capable terminal window): For me, that prints two lines: "aäm" followed by "AÄM" (which I am posting here as (In the absence of a utf8 display, I'd pipe the output to some other process that would "hexify" the byte stream, so that I could confirm it against a code chart.) | [reply] [d/l] |
|
Re: bug in utf8 handling? (utfm)
by tye (Sage) on Oct 04, 2006 at 05:29 UTC | |
perldoc -f uc would tell you "Respects current LC_CTYPE locale if "use locale" in force". Add -Mlocale to your test cases and see if that "fixes" it. Perl doesn't uppercase other than a..z by default, regardless of any utf-8 concerns. Update: Interesting. My testing confirmed my suspicion (which had been confirmed by the fine manual), however, further testing showed that uc (without -Mlocale) appears to not impact accented letters encoded in Latin-1 but does impact the exact same accented letters if they are encoded in utf-8. Perhaps the fine manual could use an update? I also see now that you mention -Mlocale not having an impact. So I'll update my guess to be that Perl doesn't know that your string literals are meant to be utf-8 strings and so it is interpretting them as byte strings (or Latin-1 strings, depending on what you want to call Perl's non-utf-8 strings). The "use utf8;" tells Perl that your string literals are utf-8 and so it upcases correctly and also translates to Latin-1 when writing to STDOUT (since you haven't declared STDOUT as being a utf-8 file handle). encoding.pm notes that it declares STDOUT to be in a specific encoding. However, it also says that it interprets your source code as being in that encoding, not in utf-8. So that leaves me wondering how your source can be correctly interpretted as utf-8 and correctly interpretted as iso8859. Which also makes me curious if your test cases behave the same if saved to Perl script files instead of being typed into the command line. I'm also used to holding a belief that Perl's parser reads lines at a time and a pragma that changes how the source code is read won't have any effect until the next line. So whether or not you have a newline after something like "use utf8;" might make a difference and since source code entered on the command line isn't read by perl, perhaps that makes a difference as well. But mostly I'm tired (but not able to get to sleep) so some of my wondering will likely just leave me wondering why I wasn't wandering sooner. Update: Aha! Your source code is only using single-byte characters however bytes with the msb set get interpretted differently when those pragmas are used? Actually, I'd expect such "8-bit" characters to cause an error or warning when read after "use utf8;". You could assign your string literals to variables so that you could determine (in each case) if Perl interpretted the string as utf-8, in order to eliminate some guessing. - tye | [reply] |
|
Re: bug in utf8 handling?
by jethro (Monsignor) on Oct 04, 2006 at 15:08 UTC | |
@tye: The test cases are condensed versions of a script I'm writing. So the same happens in script-form (with 'use utf8' on its own line). In the script the strings were also first assigned to variables. But I didn't check them with unpack if you meant that. I also piped into files and then checked the content of the file. @graff: I got exactly the same output as you did. Didn't know the -C flag. Here results of further experiments (c3 a4 is the utf8 codepoint of ä, e4 is latin1 ä): This shows that the internal representation is in iso and it expects iso input and output. The 'ä' in the script is therefore not recognised as an 'ä' but as two iso chars and consequently can't get uppercased. Now when I tried the same with 'use utf8;' (to cut a long story short), I found out that it really only changes interpretation of literal script 'ä' without changing internal representation or any IO. Which means 'use utf8' works correctly, but somewhere there's a documentation and installation deficiency. It should be documented that 'use utf8;' should not be used on utf8 machines without an additional switch -CS. Furthermore on utf8 machines -CS should be enabled by default. Otherwise scripts written on iso machines break on utf8 machines and vice versa. I don't fancy changing all my scripts to include the -CS in the first line. Another thought: If 'use encoding utf8' changes IO formats like the -CS switch, using it would break backward compatibility to iso machines, which is not that desirable. Ideally the perl interpreter should know where it is running and handle the script accordingly. Which brings us back to locale, which sadly seems to be ignored at the moment. | [reply] [d/l] |
by Hue-Bond (Priest) on Oct 04, 2006 at 16:06 UTC | |
(c3 a4 is the utf8 codepoint of ä No. Codepoints are numbers. c3 a4 is the UTF8 representation of codepoint 00E4:
Or, in a more legible form:
This shows that the internal representation is in iso You should not assume anything about the internal representation of perl strings. It may change in the future. It surprises me than no one suggested Encode yet. With it, you can decode strings to Perl internal format, mangle them at your will and encode them back when printing them out:
Furthermore on utf8 machines -CS should be enabled by default I thought that too but it ended being a bad idea. Yes, great for UTF-8 encoded text files but, what if you're working with a binary? Instead of using binmode :raw on binaries, I chose to drop -C and binmode :utf8 on UTF-8 text files, like the rest of the world. And, if you've not noticed yet, there's no mention of use utf8 in this post (well, almost ;^)). AIUI, utf8 serves a totally different purpose, namely:
-- | [reply] [d/l] [select] |
by jethro (Monsignor) on Oct 04, 2006 at 17:03 UTC | |
The Encode pragma is nice, but if I use that, I have to put encode in all places I use a literal string or print out something (the program I'm working on is 7000 lines at the moment). And more importantly I hardcode every detail of my environment in my script. If I do that, it should at least be only in one place.
Binmode looks promising besides using -C, but both have the disadvantage of hardcoding the machine/platform into the script. But I see your point with binary files. About your last point about 'use utf8;': Sure that may be its purpose, but it also changes the meaning of string literals in the script. If I want to use 'ä' in a string literal and have it interpreted/manipulated correctly, I may have to use the pragma. Or use \x or \N anywhere I have an Umlaut, which is quite painful I guess. But thanks alot, your message was most informative. What I didn't understand was your comment about dropping -c on utf8 text files? Did you mean -C or something else? | [reply] |
by jethro (Monsignor) on Oct 04, 2006 at 18:24 UTC | |
by graff (Chancellor) on Oct 16, 2006 at 09:02 UTC | |
| A reply falls below the community's threshold of quality. You may see it by logging in. | |