[OT] ASCII, cmd.exe, linux console, charset, code pages, fonts and other amenities

Discipulus has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I'm developing a little game intended to be run into a console. For this reason I have to be sure about which characters can I print into the console.

1) Can I assume chars 0..127 are the same everywhere? I fear not.. at least some difference between Linux and Windows

perl -e "print chr($_) for 0..127"  # cmd.exe

 
123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcde
+fghijklmnopqrstuvwxyz{|}~

perl -e 'print chr($_) for 0..127'  # linux

123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
+wxyz{|}~
# update: I forgot the tilde in the linux output thanks choroba who sp
+otted it
[download]

On some windows version first chars ( control ones 0..31 ) are displayed as funny elements: smiling faces, palying card seeds.. I suppose this is not portable at all.

2) What about extended ASCII?

perl -e "print chr($_) for 128..255" # cmd.exe

perl -e 'print chr($_) for 128..255' # linux
[download]

I see a lot fancy charcaters printed in cmd.exe and only garbage in Linux console. Here the differences are bigger.. I suspect I cannot trust these chars to be the same everywhere.

3) cmd.exe has the codepage notion (see chcp where is stated the very limited support for unicode). Linux has the locale with the character set is specified.

4) So which characters I can expect to be printed equal on different platform? Only 32..127 ? what about 128..255 ones?

5) M for mountains and m for hills are a bit ugly to see, so I played a bit with bitfontmaker2 website and it permit the easy creation of a custom font. It offers to modify/create 316 (?!) different chars... how can these are mapped with our perl 0..255 ones? I know that cmd.exe only has a, even limeted, support for monospaced ttf fonts: I have tried a custom Lucida Console font with success ( horrible atm available here ). Are these fonts copyrighted? In case I can create a custom, usable font, how can I map to these custom chars? via chr(xx) ?

Thanks for any clarification you will be able to provide.

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Comment on [OT] ASCII, cmd.exe, linux console, charset, code pages, fonts and other amenities Select or Download Code

Replies are listed 'Best First'.
Re: [OT] ASCII, cmd.exe, linux console, charset, code pages, fonts and other amenities by haukex (Archbishop) on Mar 29, 2019 at 12:53 UTC
To play it safe, only print characters in the range `0x20` to `0x7E` (inclusive), plus of course `\n` and `\t`. What other characters would you want to print? On Linux, I think you can use `${^UTF8LOCALE}` to detect if there is UTF-8 support present, but I'm not an expert on this, there's probably some caveats there. You might just want to ask the user what encoding to use on output, and use e.g. `use open qw/:std :encoding(UTF-8)/;` And on Windows... good luck `;-)` I think `Win32::GetConsoleOutputCP()` can tell you the current console's encoding. Unfortunately I have no experience on custom fonts.	[reply] [d/l] [select]
Re: [OT] ASCII, cmd.exe, linux console, charset, code pages, fonts and other amenities by ikegami (Patriarch) on Mar 30, 2019 at 04:14 UTC
So which characters I can expect to be printed equal on different platform? Only 32..127 ? what about 128..255 ones? You need to provide what the terminal expects. You can probably rely on the character set being based on ASCII, so you should be able to print ASCII's basic whitespace characaters (9, 10, 13, 32) and its non-whitespace printable characters (33..126) without problem. (127 is a control character.) If you want to print other characters, you will need to correctly encode your output. If you expect to receive other characters, you will need to correctly decode your input. You can do this using the following: BEGIN { if ($^O eq 'Win32') { require Win32; my $cie = "cp" . Win32::GetConsoleCP(); my $coe = "cp" . Win32::GetConsoleOutputCP(); my $ae = "cp" . Win32::GetACP(); binmode(STDIN, ":encoding($cie)"); binmode(STDOUT, ":encoding($coe)"); binmode(STDERR, ":encoding($coe)"); require open; "open"->import(":encoding($ae)"); require Encode; @ARGV = map { Encode::decode($ae, $_) } @ARGV; } else { require encoding; my $e = encoding::_get_locale_encoding() // 'UTF-8'; require open; "open"->import(':std', ":encoding($e)"); require Encode; @ARGV = map { Encode::decode($e, $_) } @ARGV; } } [download] Note: While UTF-8 is probably the only encoding you need to deal with on modern unix systems, you have to deal with 4 different encodings on Windows. System calls are made using one's choice of the system's "ANSI" interface (e.g. cp1252) or using the "Wide" (UTF-16le) interface. (Perl only uses the ANSI interface, though modules can still use either/both.) The ANSI code page is hardcoded for your version of Windows. The console uses a configurable encoding known as the OEM code page (e.g. cp437, cp850). The default OEM code page is based on your language settings. (For some reason, a console's input and output encoding can be different, but I have no idea how/why that would happen.) Finally, lots of data encountered is encoded using UTF-8. This brings up two "unanswerable" questions: Which encoding should be used for a file by default? The above assumes ANSI CP. Which encoding should used for STDIN, STDOUT and STDERR when they're not connected to terminal? The above assumes the OEM CP. This means that printing to a file opened by Perl and printing to STDOUT redirected to a file will produce files encoded using different encoding, but it means that `foo \| find "bar"` will produce readable output. Note: On Windows, the arguments are always provided encoded using the system's Active (aka "ANSI") code page (e.g. 1252), not the console's (aka "OEM") code page (e.g. 473, 850, 65001), so only characters that exist both in the ANSI and OEM code page can be passed via arguments. So even if the console is using UTF-8, arguments are limited to using the machine's ANSI character set. This limit can be worked around by obtaining the command line using `GetCommandLineW` and re-parsing it.	[reply] [d/l] [select]
Re: [OT] ASCII, cmd.exe, linux console, charset, code pages, fonts and other amenities by Corion (Patriarch) on Mar 29, 2019 at 13:34 UTC
On Windows (7 and 10 at least), you can run `chcp 65001` [download] to make the console (and thus STDOUT) understand UTF-8. Then, many more characters become supported.	[reply] [d/l]
Re^2: [OT] ASCII, cmd.exe, linux console, charset, code pages, fonts and other amenities by Discipulus (Canon) on Mar 29, 2019 at 20:54 UTC
thanks Corion, I read about this already but really do not understand what it really means.. chcp Active code page: 65001 perl -e "print chr($_) for 1..256" Wide character in print at -e line 1. !"#$%&'()+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVW +XYZ[\]^_`abcdefghijklmnopqrstuvwxyz{\|}~ ��������&#6553 +3;� .... more of these.. [download] Even with this codepage and in win7 I get the dreadful wide character in print message.. L There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]
Re^3: [OT] ASCII, cmd.exe, linux console, charset, code pages, fonts and other amenities by choroba (Cardinal) on Mar 29, 2019 at 20:59 UTC
You need to tell Perl your terminal now works in UTF-8. `perl -CO -we "print chr for 1 .. 255"` [download] An you can try higher numbers than 255, too. Update: Missing -e, thanks Your Mother. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]