in reply to Re^2: Character encoding in console in Windows
in thread Character encoding in console in Windows

>perl -MDevel::Peek -e"chomp($_=<STDIN>); Dump($_); open($fh, '<', $_) + or die; print <$fh>" C:\Users\ikegami\í.txt SV = PV(0xeaf80) at 0x2f59c8 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x2f3788 "C:\\Users\\ikegami\\\241.txt"\0 CUR = 22 LEN = 80 Died at -e line 1, <STDIN> line 1.

Is that encoded for my ANSI code page (1252) or my OEM code page (437)?

>perl -MDevel::Peek -MEncode -e"Dump(encode('cp1252', chr(0xED))) SV = PV(0x2fa570) at 0x27b210 REFCNT = 1 FLAGS = (TEMP,POK,pPOK) PV = 0x32c24f8 "\355"\0 CUR = 1 LEN = 8 >perl -MDevel::Peek -MEncode -e"Dump(encode('cp437', chr(0xED))) SV = PV(0x30a570) at 0x28b210 REFCNT = 1 FLAGS = (TEMP,POK,pPOK) PV = 0x32c24f8 "\241"\0 CUR = 1 LEN = 8

So open() expects the name to be encoded using the ANSI code page, but it's coming from STDIN in the OEM code page.

>perl -MDevel::Peek -MEncode=from_to -e"chomp($_=<STDIN>); from_to($_, + 'cp437', 'cp1252'); Dump($_); open($fh, '<', $_) or die; print <$fh> +" C:\Users\ikegami\í.txt SV = PV(0x159adc0) at 0x315b58 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x344d258 "C:\\Users\\ikegami\\\355.txt"\0 CUR = 22 LEN = 80 ok

In broad strokes, the OEM code page is the encoding used by console apps, ANSI for others. Now how do you get those code pages? Good question.

Replies are listed 'Best First'.
Re^4: Character encoding in console in Windows
by BrowserUk (Patriarch) on Sep 16, 2010 at 07:15 UTC
    Now how do you get those code pages?

    using chcp?

    c:\test>chcp && perl -E" say `chcp`; say `chcp 437`; say`chcp`" && chc +p Active code page: 850 Active code page: 850 Active code page: 437 Active code page: 437 Active code page: 437

    Alternatively, Win32::Console:

    InputCP [codepage] Gets or sets the input code page used by the console. Note that this d +oesn't apply to a console object, but to the standard input console. +This attribute is used by the Write method. See also: OutputCP. Example: $codepage = $CONSOLE->InputCP(); $CONSOLE->InputCP(437); # you may want to use the non-instanciated form to avoid confuzion + :) $codepage = Win32::Console::InputCP(); Win32::Console::InputCP(437);

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      What about the other? I knew how to get that one, but I didn't have time then to find how to get the ANSI code page.

      Could always use Win32::API, of course. The functions are GetACP and GetOEMCP.

      From my brief research ( microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/chcp.mspx?mfr=true ), it looks like trying to set the console encoding with system (chcp XXX) seems like a bad idea. It doesn't even seem possible to set it to UTF-8, and setting it to anything else will probably make it impossible to use the very characters that are causing the problems in the first place, because there is no single OEM codepage that covers all the characters people may use... and I'd be reluctant to mess with the settings of other people's computers anyway (does this setting only affect the current session or is it permanent?)

      So reading the console's encoding (which depends on OS localization) and then converting the incoming text in Perl accordingly sounds better to me... but I'm just taking stabs in the dark. I can't follow half of the posts here, but I can't see working code in any of them so far.

      BTW as I said before, this is just one half of the issue.
      Even if I were to go
      #!/usr/bin/perl use strict; use warnings; use utf8; open(FILE, "<:encoding(UTF-8)", "c:\\folder\\í.txt") or print "Oops, c +an't open file: $!"; <STDIN>;
      ..and save this in UTF-8, it would still fail to open the file. It seems pretty clear that I'd need to use one of the modules to ever be able to open a file with a non-ASCII name, and I can't really make sense of the documentation of the modules.

      So the step-by-step seems to be:
      1) read what the console's OEM encoding is
      2) convert filepath received via STDIN from OEM to UTF-8
      3) open the file using one of the Unicode modules

      ...or maybe I'm completely wrong.

        it would still fail to open the file.

        Correct. You'd need to encode using the current ANSI code page (Windows) or current locale encoding (unix).

        In Windows, you could also use the system's wide character interface (CreateFileW) via some means other than open. Win32API::File's CreateFileW function would be one such mean.