in reply to [OT] ASCII, cmd.exe, linux console, charset, code pages, fonts and other amenities
So which characters I can expect to be printed equal on different platform? Only 32..127 ? what about 128..255 ones?
You need to provide what the terminal expects. You can probably rely on the character set being based on ASCII, so you should be able to print ASCII's basic whitespace characaters (9, 10, 13, 32) and its non-whitespace printable characters (33..126) without problem. (127 is a control character.)
If you want to print other characters, you will need to correctly encode your output.
If you expect to receive other characters, you will need to correctly decode your input.
You can do this using the following:
BEGIN { if ($^O eq 'Win32') { require Win32; my $cie = "cp" . Win32::GetConsoleCP(); my $coe = "cp" . Win32::GetConsoleOutputCP(); my $ae = "cp" . Win32::GetACP(); binmode(STDIN, ":encoding($cie)"); binmode(STDOUT, ":encoding($coe)"); binmode(STDERR, ":encoding($coe)"); require open; "open"->import(":encoding($ae)"); require Encode; @ARGV = map { Encode::decode($ae, $_) } @ARGV; } else { require encoding; my $e = encoding::_get_locale_encoding() // 'UTF-8'; require open; "open"->import(':std', ":encoding($e)"); require Encode; @ARGV = map { Encode::decode($e, $_) } @ARGV; } }
Note: While UTF-8 is probably the only encoding you need to deal with on modern unix systems, you have to deal with 4 different encodings on Windows. System calls are made using one's choice of the system's "ANSI" interface (e.g. cp1252) or using the "Wide" (UTF-16le) interface. (Perl only uses the ANSI interface, though modules can still use either/both.) The ANSI code page is hardcoded for your version of Windows. The console uses a configurable encoding known as the OEM code page (e.g. cp437, cp850). The default OEM code page is based on your language settings. (For some reason, a console's input and output encoding can be different, but I have no idea how/why that would happen.) Finally, lots of data encountered is encoded using UTF-8. This brings up two "unanswerable" questions:
This means that printing to a file opened by Perl and printing to STDOUT redirected to a file will produce files encoded using different encoding, but it means that foo | find "bar" will produce readable output.
Note: On Windows, the arguments are always provided encoded using the system's Active (aka "ANSI") code page (e.g. 1252), not the console's (aka "OEM") code page (e.g. 473, 850, 65001), so only characters that exist both in the ANSI and OEM code page can be passed via arguments. So even if the console is using UTF-8, arguments are limited to using the machine's ANSI character set. This limit can be worked around by obtaining the command line using GetCommandLineW and re-parsing it.
|
|---|