Re^3: bug in utf8 handling?

Well, the internal representation was important to find out what perl was doing and whether it was doing the right thing. But you are right with C3 A4 not being a codepoint.

The Encode pragma is nice, but if I use that, I have to put encode in all places I use a literal string or print out something (the program I'm working on is 7000 lines at the moment). And more importantly I hardcode every detail of my environment in my script. If I do that, it should at least be only in one place.

Binmode looks promising besides using -C, but both have the disadvantage of hardcoding the machine/platform into the script. But I see your point with binary files.
Essentially I will have to create a subroutine that finds out whether I'm on an utf8 machine and read text files accordingly.

About your last point about 'use utf8;': Sure that may be its purpose, but it also changes the meaning of string literals in the script. If I want to use 'ä' in a string literal and have it interpreted/manipulated correctly, I may have to use the pragma. Or use \x or \N anywhere I have an Umlaut, which is quite painful I guess.

But thanks alot, your message was most informative. What I didn't understand was your comment about dropping -c on utf8 text files? Did you mean -C or something else?

Comment on Re^3: bug in utf8 handling?

Replies are listed 'Best First'.
Re^4: bug in utf8 handling? by jethro (Monsignor) on Oct 04, 2006 at 18:24 UTC
I just read perldoc perlfunc on binmode: `In other words: regardless of platform, use binmode() on binary data, + like for example images.` [download] Since binary data is the exception (most perl scripts use strings I guess), this would be sensible and -C could then be made default on utf8 machines. But the downside is, nobody expects the mangling of binary files as default (and the spanish inquisition).	[reply] [d/l]
Re^4: bug in utf8 handling? by graff (Chancellor) on Oct 16, 2006 at 09:02 UTC
Binmode looks promising besides using -C, but both have the disadvantage of hardcoding the machine/platform into the script. Actually, binmode is definitely the preferred method, as well as 3-arg open on file handles. There are some problems with -C, and this option is likely to get phased out in the future. You can easily make the encoding a configurable parameter, to be set just once and used consistently throughout the app. Depending on how you've written the app so far, you might just need to convert your "open" statements to use the 3-arg format: `# during intialization: $encoding = "utf8"; # or "encoding(cp1252)" or whatever binmode STDOUT, $encoding; # (if this is appropriate) binmode STDIN, $encoding; # ... # then make all open statements look like this: # open( INHANDLE, "<$encoding", $ifilename ) # open( OUTHANDLE, ">$encoding", $ofilename )` [download] (update: added a second open() example to make a point: this way, perl will always be dealing with unicode character strings, so that "." always matches one character, "uc" does the right thing, etc.) There's also the "use open" pragma, although I can't seem to get it to work for output file handles. (Works great for setting encoding mode on input -- esp. if you use the magical ARGV file handle.) But I see your point with binary files. Yes, there really were a lot of people (esp. on Red Hat systems with Perl 5.8.0, as it turned out), with a lot of perl scripts that handled binary data and assumed the "text/binary" file-mode distinction was not an issue for them ("just open the file..."). And then suddenly, when a file handle's encoding mode was set by default to be consistent with the user's locale (which by default was utf8), all hell broke loose. That sort of default behavior has been discontinued (corrected), and those people with those old scripts are still out there, blissfully ignoring how some other people would like utf8 to be the default file mode. These are hard times for setting up default behaviors...	[reply] [d/l]