John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

WinNT/2k/XP uses Unicode. The "console" window has Unicode versions and Code-Page versions of the various input and output functions. Perl supports Unicode. All should be well, right?

Well, first of all the default STDOUT behavior seems to truncate characters to 8 bits and warn if the ordinal was >255, NOT do a encoding translation to the current character set.

If the "current" set of 256 characters (the Code Page) is enough and I don't worry about UTF-8, the results in the Console don't match what appears in the source. That's because the Windows Code Page is 1252, but for compatibility with old text-mode code from DOS, the Console uses the old DOS code page by default.

So, I use the commands

chcp 1252 mode con cp select=1252
at the prompt, expecting it will not match the rest of the machine.

Nope! The results of programs running standard output, or the type command, etc. are still showing the DOS characters.

Anybody know the right incantation to make the Windows Console use the desired code page in this manner?

Better yet, is there a way to make Perl's output use the full Unicode capability of the Console? I'm thinking that a tie'd handle could feed to the Win32 API functions, skipping the OS's file stream.

Best of all, is there a way to make the OS's stream (that's attached to the Console) use UTF-8? That would work with any program, not just Perl.

—John

Replies are listed 'Best First'.
Re: Character Encoding and Windows Console woes
by BrowserUk (Patriarch) on Feb 16, 2004 at 21:29 UTC

    Have you tried perl -C ...? (Did it help?)

    You can enable automatic UTF-8-ification of your standard file handles, default open() layer, and @ARGV by using either the -C command line switch or the PERL_UNICODE environment variable, see the perlrun manpage for the documentation of the -C switch.

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Timing (and a little luck) are everything!
      perl -C -Mutf8 -e"print qq(\x{83})" >d.txt
      The output file only contains one byte, the 0x83. In UTF-8 it should have been 2 bytes. Printing to the console (not redirecting output), it showed one character in the OEM character set.

      The docs I have say that -C enables wide system calls (See ${^WIDE_SYSTEM_CALLS} in the perlvar manpage.)

        This changed in 5.8.1 (from perlrun)

        -C <number/list>

        The -C flag controls some Unicode of the Perl Unicode features. <<Their typo not mine>>

        ~~snip~~

        (In Perls earlier than 5.8.1 the -C switch was a Win32-only switch that enabled the use of Unicode-aware ``wide system call'' Win32 APIs. This feature was practically unused, however, and the command line switch was therefore ``recycled''.)


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        Timing (and a little luck) are everything!
        In perlunicode:
        Unicode characters can also be added to a string by using the \x{...} notation. The Unicode code for the desired character, in hexadecimal, should be placed in the braces. For instance, a smiley face is \x{263A}. This encoding scheme only works for characters with a code of 0x100 or above.
        Something for backward compatiblity, I think.
Re: Character Encoding and Windows Console woes
by John M. Dlugosz (Monsignor) on Feb 19, 2004 at 05:40 UTC
    I learned how to get the Console to work properly with Perl. I'm posting it here in the hopes of helping someone else some day.

    First, changing code pages only works properly if you change the Console window to use a truetype font. Leaving it at "raster fonts" will give strange and even inconsistant results.

    The magic incantation is to change to code page 65001. This makes the console operate in UTF-8. Perl programs will now print output correctly.

      You have helped me greatly, Sir, and I thank you. This is a precious tip. Character set issues in programs are complicated enough by themselves, without adding an additional layer of obfuscation due to the character set and font of the terminal one is using to debug them.