in reply to Re: bug in utf8 handling?
in thread bug in utf8 handling?

(c3 a4 is the utf8 codepoint of ä

No. Codepoints are numbers. c3 a4 is the UTF8 representation of codepoint 00E4:

$ perl -le'binmode STDOUT, ":utf8"; print "\x{00E4}";'|od -c 0000000 303 244 \n 0000003

Or, in a more legible form:

$ perl -CO -le'use charnames ":full"; print "\N{LATIN SMALL LETTER A W +ITH DIAERESIS}";'|od -c 0000000 303 244 \n 0000003
This shows that the internal representation is in iso

You should not assume anything about the internal representation of perl strings. It may change in the future.

It surprises me than no one suggested Encode yet. With it, you can decode strings to Perl internal format, mangle them at your will and encode them back when printing them out:

$ perl |od -c use Encode; my $c = decode "latin1", "\xe4"; $c = uc $c; $c = chr (1 + ord $c); ## further mangling print encode "latin1", $c; __END__ 0000000 305 0000001 $ perl |od -c use Encode; my $c = decode "latin1", "\xe4"; $c = uc $c; $c = chr (1 + ord $c); print encode "utf8", $c; ## <-- change here __END__ 0000000 303 205 0000002
Furthermore on utf8 machines -CS should be enabled by default

I thought that too but it ended being a bad idea. Yes, great for UTF-8 encoded text files but, what if you're working with a binary? Instead of using binmode :raw on binaries, I chose to drop -C and binmode :utf8 on UTF-8 text files, like the rest of the world.

And, if you've not noticed yet, there's no mention of use utf8 in this post (well, almost ;^)). AIUI, utf8 serves a totally different purpose, namely:

use utf8; my $á = 42; print $á, "\n"; __END__ 42

--
David Serrano

Replies are listed 'Best First'.
Re^3: bug in utf8 handling?
by jethro (Monsignor) on Oct 04, 2006 at 17:03 UTC
    Well, the internal representation was important to find out what perl was doing and whether it was doing the right thing. But you are right with C3 A4 not being a codepoint.

    The Encode pragma is nice, but if I use that, I have to put encode in all places I use a literal string or print out something (the program I'm working on is 7000 lines at the moment). And more importantly I hardcode every detail of my environment in my script. If I do that, it should at least be only in one place.

    Binmode looks promising besides using -C, but both have the disadvantage of hardcoding the machine/platform into the script. But I see your point with binary files.
    Essentially I will have to create a subroutine that finds out whether I'm on an utf8 machine and read text files accordingly.

    About your last point about 'use utf8;': Sure that may be its purpose, but it also changes the meaning of string literals in the script. If I want to use 'ä' in a string literal and have it interpreted/manipulated correctly, I may have to use the pragma. Or use \x or \N anywhere I have an Umlaut, which is quite painful I guess.

    But thanks alot, your message was most informative. What I didn't understand was your comment about dropping -c on utf8 text files? Did you mean -C or something else?

      I just read perldoc perlfunc on binmode:
      In other words: regardless of platform, use binmode() on binary data, + like for example images.
      Since binary data is the exception (most perl scripts use strings I guess), this would be sensible and -C could then be made default on utf8 machines. But the downside is, nobody expects the mangling of binary files as default (and the spanish inquisition).
      Binmode looks promising besides using -C, but both have the disadvantage of hardcoding the machine/platform into the script.

      Actually, binmode is definitely the preferred method, as well as 3-arg open on file handles. There are some problems with -C, and this option is likely to get phased out in the future.

      You can easily make the encoding a configurable parameter, to be set just once and used consistently throughout the app. Depending on how you've written the app so far, you might just need to convert your "open" statements to use the 3-arg format:

      # during intialization: $encoding = "utf8"; # or "encoding(cp1252)" or whatever binmode STDOUT, $encoding; # (if this is appropriate) binmode STDIN, $encoding; # ... # then make all open statements look like this: # open( INHANDLE, "<$encoding", $ifilename ) # open( OUTHANDLE, ">$encoding", $ofilename )
      (update: added a second open() example to make a point: this way, perl will always be dealing with unicode character strings, so that "." always matches one character, "uc" does the right thing, etc.)

      There's also the "use open" pragma, although I can't seem to get it to work for output file handles. (Works great for setting encoding mode on input -- esp. if you use the magical ARGV file handle.)

      But I see your point with binary files.

      Yes, there really were a lot of people (esp. on Red Hat systems with Perl 5.8.0, as it turned out), with a lot of perl scripts that handled binary data and assumed the "text/binary" file-mode distinction was not an issue for them ("just open the file..."). And then suddenly, when a file handle's encoding mode was set by default to be consistent with the user's locale (which by default was utf8), all hell broke loose.

      That sort of default behavior has been discontinued (corrected), and those people with those old scripts are still out there, blissfully ignoring how some other people would like utf8 to be the default file mode. These are hard times for setting up default behaviors...