in reply to Character Encoding and Windows Console woes

Have you tried perl -C ...? (Did it help?)

You can enable automatic UTF-8-ification of your standard file handles, default open() layer, and @ARGV by using either the -C command line switch or the PERL_UNICODE environment variable, see the perlrun manpage for the documentation of the -C switch.

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Timing (and a little luck) are everything!

Replies are listed 'Best First'.
Re: Re: Character Encoding and Windows Console woes
by John M. Dlugosz (Monsignor) on Feb 16, 2004 at 21:37 UTC
    perl -C -Mutf8 -e"print qq(\x{83})" >d.txt
    The output file only contains one byte, the 0x83. In UTF-8 it should have been 2 bytes. Printing to the console (not redirecting output), it showed one character in the OEM character set.

    The docs I have say that -C enables wide system calls (See ${^WIDE_SYSTEM_CALLS} in the perlvar manpage.)

      This changed in 5.8.1 (from perlrun)

      -C <number/list>

      The -C flag controls some Unicode of the Perl Unicode features. <<Their typo not mine>>

      ~~snip~~

      (In Perls earlier than 5.8.1 the -C switch was a Win32-only switch that enabled the use of Unicode-aware ``wide system call'' Win32 APIs. This feature was practically unused, however, and the command line switch was therefore ``recycled''.)


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      Timing (and a little luck) are everything!
        Thanks for pointing that out. The html docs on one machine must not have been updated properly, since it's Perl 5.8.2. (I noticed that Win32::API is missing from the index pane, even though the module is present, so this is the second anomoly today). On another machine, I see a different -C documentation.

      In perlunicode:
      Unicode characters can also be added to a string by using the \x{...} notation. The Unicode code for the desired character, in hexadecimal, should be placed in the braces. For instance, a smiley face is \x{263A}. This encoding scheme only works for characters with a code of 0x100 or above.
      Something for backward compatiblity, I think.
        If at least one character in the string has a code of >= 0x100, then all characters >0x7F will be multi-byte encoded. If all the characters are less than 256, then it is also possible to encode the string with one byte per character. Some functions, like chr, make it a point to use the byte form when possible, since it's faster.