Re: bug in utf8 handling?

@benizi: Yes, my display is utf8, I get 0000000 303 244 \n as output.

@tye: The test cases are condensed versions of a script I'm writing. So the same happens in script-form (with 'use utf8' on its own line). In the script the strings were also first assigned to variables. But I didn't check them with unpack if you meant that. I also piped into files and then checked the content of the file.

@graff: I got exactly the same output as you did. Didn't know the -C flag. Here results of further experiments (c3 a4 is the utf8 codepoint of ä, e4 is latin1 ä):

echo ä | od -t x1
c3 a4 0a   <--- utf
perl -e '$f=<>;  print "$f"," \xe4"," \xc3\xa4"," ä\n" ' | od -t x1
ä        <--- my input to the <>
c3 a4 0a 20 e4 20 c3 a4 20 c3 a4 0a  <-- utf iso utf utf
perl -CI -e '$f=<>;  print "$f"," \xe4"," \xc3\xa4"," ä\n" ' | od -t x
+1
ä
e4 0a 20 e4 20 c3 a4 20 c3 a4 0a  <-- iso iso utf utf
perl -CO -e '$f=<>;  print "$f"," \xe4"," \xc3\xa4"," ä\n" ' | od -t x
+1
ä
c3 83 c2 a4 0a 20 c3 a4 20 c3 83 c2 a4 20 c3 83 c2 a4 0a  <-- utfgarba
+ge utf utfgarbage utfgarbage
perl -CS -e '$f=<>;  print "$f"," \xe4"," \xc3\xa4"," ä\n" ' | od -t x
+1
c3 a4 0a 20 c3 a4 20 c3 83 c2 a4 20 c3 83 c2 a4 0a   <-- utf utf utfga
+rbage utfgarbage
[download]

This shows that the internal representation is in iso and it expects iso input and output. The 'ä' in the script is therefore not recognised as an 'ä' but as two iso chars and consequently can't get uppercased.

Now when I tried the same with 'use utf8;' (to cut a long story short), I found out that it really only changes interpretation of literal script 'ä' without changing internal representation or any IO.

Which means 'use utf8' works correctly, but somewhere there's a documentation and installation deficiency.

It should be documented that 'use utf8;' should not be used on utf8 machines without an additional switch -CS.

Furthermore on utf8 machines -CS should be enabled by default. Otherwise scripts written on iso machines break on utf8 machines and vice versa. I don't fancy changing all my scripts to include the -CS in the first line.

Another thought: If 'use encoding utf8' changes IO formats like the -CS switch, using it would break backward compatibility to iso machines, which is not that desirable. Ideally the perl interpreter should know where it is running and handle the script accordingly. Which brings us back to locale, which sadly seems to be ignored at the moment.

Comment on Re: bug in utf8 handling? Download Code

Replies are listed 'Best First'.
Re^2: bug in utf8 handling? by Hue-Bond (Priest) on Oct 04, 2006 at 16:06 UTC
(c3 a4 is the utf8 codepoint of Ã¤ No. Codepoints are numbers. `c3 a4` is the UTF8 representation of codepoint 00E4: `$ perl -le'binmode STDOUT, ":utf8"; print "\x{00E4}";'\|od -c 0000000 303 244 \n 0000003` [download] Or, in a more legible form: `$ perl -CO -le'use charnames ":full"; print "\N{LATIN SMALL LETTER A W +ITH DIAERESIS}";'\|od -c 0000000 303 244 \n 0000003` [download] This shows that the internal representation is in iso You should not assume anything about the internal representation of perl strings. It may change in the future. It surprises me than no one suggested Encode yet. With it, you can decode strings to Perl internal format, mangle them at your will and encode them back when printing them out: `$ perl \|od -c use Encode; my $c = decode "latin1", "\xe4"; $c = uc $c; $c = chr (1 + ord $c); ## further mangling print encode "latin1", $c; __END__ 0000000 305 0000001 $ perl \|od -c use Encode; my $c = decode "latin1", "\xe4"; $c = uc $c; $c = chr (1 + ord $c); print encode "utf8", $c; ## <-- change here __END__ 0000000 303 205 0000002` [download] Furthermore on utf8 machines -CS should be enabled by default I thought that too but it ended being a bad idea. Yes, great for UTF-8 encoded text files but, what if you're working with a binary? Instead of using binmode `:raw` on binaries, I chose to drop `-C` and binmode `:utf8` on UTF-8 text files, like the rest of the world. And, if you've not noticed yet, there's no mention of `use utf8` in this post (well, almost ;^)). AIUI, utf8 serves a totally different purpose, namely: `use utf8; my $á = 42; print $á, "\n"; __END__ 42` [download] -- David Serrano	[reply] [d/l] [select]
Re^3: bug in utf8 handling? by jethro (Monsignor) on Oct 04, 2006 at 17:03 UTC
Well, the internal representation was important to find out what perl was doing and whether it was doing the right thing. But you are right with C3 A4 not being a codepoint. The Encode pragma is nice, but if I use that, I have to put encode in all places I use a literal string or print out something (the program I'm working on is 7000 lines at the moment). And more importantly I hardcode every detail of my environment in my script. If I do that, it should at least be only in one place. Binmode looks promising besides using -C, but both have the disadvantage of hardcoding the machine/platform into the script. But I see your point with binary files. Essentially I will have to create a subroutine that finds out whether I'm on an utf8 machine and read text files accordingly. About your last point about 'use utf8;': Sure that may be its purpose, but it also changes the meaning of string literals in the script. If I want to use 'ä' in a string literal and have it interpreted/manipulated correctly, I may have to use the pragma. Or use \x or \N anywhere I have an Umlaut, which is quite painful I guess. But thanks alot, your message was most informative. What I didn't understand was your comment about dropping -c on utf8 text files? Did you mean -C or something else?	[reply]
Re^4: bug in utf8 handling? by jethro (Monsignor) on Oct 04, 2006 at 18:24 UTC
I just read perldoc perlfunc on binmode: `In other words: regardless of platform, use binmode() on binary data, + like for example images.` [download] Since binary data is the exception (most perl scripts use strings I guess), this would be sensible and -C could then be made default on utf8 machines. But the downside is, nobody expects the mangling of binary files as default (and the spanish inquisition).	[reply] [d/l]
Re^4: bug in utf8 handling? by graff (Chancellor) on Oct 16, 2006 at 09:02 UTC
Binmode looks promising besides using -C, but both have the disadvantage of hardcoding the machine/platform into the script. Actually, binmode is definitely the preferred method, as well as 3-arg open on file handles. There are some problems with -C, and this option is likely to get phased out in the future. You can easily make the encoding a configurable parameter, to be set just once and used consistently throughout the app. Depending on how you've written the app so far, you might just need to convert your "open" statements to use the 3-arg format: `# during intialization: $encoding = "utf8"; # or "encoding(cp1252)" or whatever binmode STDOUT, $encoding; # (if this is appropriate) binmode STDIN, $encoding; # ... # then make all open statements look like this: # open( INHANDLE, "<$encoding", $ifilename ) # open( OUTHANDLE, ">$encoding", $ofilename )` [download] (update: added a second open() example to make a point: this way, perl will always be dealing with unicode character strings, so that "." always matches one character, "uc" does the right thing, etc.) There's also the "use open" pragma, although I can't seem to get it to work for output file handles. (Works great for setting encoding mode on input -- esp. if you use the magical ARGV file handle.) But I see your point with binary files. Yes, there really were a lot of people (esp. on Red Hat systems with Perl 5.8.0, as it turned out), with a lot of perl scripts that handled binary data and assumed the "text/binary" file-mode distinction was not an issue for them ("just open the file..."). And then suddenly, when a file handle's encoding mode was set by default to be consistent with the user's locale (which by default was utf8), all hell broke loose. That sort of default behavior has been discontinued (corrected), and those people with those old scripts are still out there, blissfully ignoring how some other people would like utf8 to be the default file mode. These are hard times for setting up default behaviors...	[reply] [d/l]