syphilis has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

A C library that I'm accessing from perl (via XS) conditionally might attempt to print out ± or ∞.
The library code attempts to do this by passing the infinity or the plus-or-minus as a string literal to fprintf().
If the filehandle is stdout, then these symbols are being displayed by my Windows perls as garbage.
No such problem on Linux ... and no such problem with perls running under Cygwin on the very same Windows 11 box. All works there as intended.
In fact, it's also fine on the Windows system if the output filehandle is to a file, or if I redirect the output to a file (rather than allowing it to go to the cmd.exe console)

By way of demonstration, I have the following Inline::C script
use strict; use warnings; use utf8; use Inline C => <<'EOC'; void wprint () { printf("+ or - : ±\n"); } void foo () { unsigned char a = '±'; printf("%d\n", a); } EOC wprint(); foo();
For me, that outputs:

+ or - : ▒
177

The question is:
Leaving the C code as it is, how do I get the plus-or-minus symbol to display as intended on Windows (cp437 or cp850) when the display is being sent to stdout ?
I thought this would easy ... but not for me. Even using Text::Iconv (which I'm sure that I once understood) is doing nothing. So, FAIK, it might not even be possible ?? (Is it something that can only be handled within the C code ?)

And using stuff that I don't understand (like use open ':std', ':encoding(cp437)') is also having no effect, irrespective of whether I specify cp437, cp850 or cp1252.

According to chcp, my Windows 11 codepage is cp437. On my Windows 7, where the same issue arises, the codepage is cp850.

Any help is much appreciated.

Cheers,
Rob

Replies are listed 'Best First'.
Re: Encoding problem with function in C library
by haukex (Archbishop) on Dec 22, 2022 at 10:27 UTC

    Could you clarify a few things?

    it's also fine on the Windows system if the output filehandle is to a file, or if I redirect the output to a file (rather than allowing it to go to the cmd.exe console)

    In both of those cases, what encoding is the resulting file in? (see for example my tool enctool)

    use utf8; ... unsigned char a = '±';

    That looks wrong to me because a char is one byte and you've given it two (C2 B1), so I don't know if your C example is representative, and gcc does complain warning: multi-character character constant [-Wmultichar].

    In pure Perl, with a combination of chcp 65001 and use open qw/:std :encoding(UTF-8)/;, I can get it to output correctly, but I guess the question is what encoding the library is using (hence my question above).

    C:\Temp>type test.pl
    #!perl
    use warnings;
    use strict;
    use open qw/:std :encoding(UTF-8)/;
    
    print "\N{U+B1}\N{U+221E}\n";
    
    C:\Temp>chcp
    Active Codepage: 850.
    
    C:\Temp>perl test.pl
    ┬▒Ôê×
    
    C:\Temp>chcp 65001
    Active Codepage: 65001.
    
    C:\Temp>perl test.pl
    ±∞
    
      That looks wrong to me because a char is one byte and you've given it two (C2 B1),...

      Yeah - when I do a noisy build (and thereby have any warnings displayed) I see the same warning.
      I also see warning: unsigned conversion from 'int' to 'unsigned char' changes value from '49841' to '177' [-Woverflow].
      Then, when the chr(177) gets written to a text file, it displays as the desired plus-or-minus symbol when viewed in Windows notepad.

      In pure Perl, with a combination of chcp 65001 and use open qw/:std :encoding(UTF-8)/;, I can get it to output correctly

      Thanks for that - I see exactly the same behaviours with your test.pl as you did.

      The mention of codepage 65001 looks to have been the godsend I was looking for.
      Using that codepage, this troublesome C library function (in mpc-1.3.x) by the name of "mpcr_out_str", then displays correctly when accessed from the Math::MPC module.

      Is there some way I can manipulate the active code page in perl (on windows) without shelling out to chcp ?
      I'm thinking that, for Windows only, Math::MPC needs to change the codepage to 65001 before calling this function ... and then it ought also revert the codepage to its original setting immediately after the function has been run.
      I guess I should also check to see if 65001 ought only be set when Windows prints to stdout.
      Does that sound reasonable ? (I will check this on a range of perls on Windows 7 and Windows 11 anyway.)
      Update: Hmmm, a quick check on Windows 7 reveals that changing the codepage is apparently having no effect. Bummer !! (But it's getting late over here, and will have to wait until I've had some sleep, before checking further.)

      Thanks ever so much for the responses. Things are now a little clearer.

      Cheers,
      Rob

        Win32::Console gives you the interface to ->OutputCP:

        #!perl use strict; use warnings; use charnames ':full'; use Win32::Console; my $c = Win32::Console->new(STD_OUTPUT_HANDLE); $c->OutputCP(65001); # we write UTF-8 binmode STDOUT, ':encoding(UTF-8)'; print "\N{INFINITY}\n";

        In my tests, the output code page persisted after the program run, so you might (or might not) want to save/restore the codepage:

        my $oldCP = $c->OutputCP(); $c->OutputCP(65001); # we write UTF-8 END{ if( $c ) { $c->OutputCP($oldCP); # we write UTF-8 } } ...
        Then, when the chr(177) gets written to a text file, it displays as the desired plus-or-minus symbol when viewed in Windows notepad.

        In this case it just so happens that the UTF-8 encoding of U+00B1 PLUS-MINUS SIGN is C2 B1 and I guess that the C compiler is doing the equivalent of unsigned char a = 0xC2B1 & 0xFF. It also just so happens that 0xB1 (177) is the character ± in CP1252, Latin-1, and others (which I guess is Notepad's interpretation), but in CP850 and CP437, 0xB1 is ▒. You'll probably not see this happening with ∞ U+221E INFINITY, whose UTF-8 encoding is E2 88 9E, but which is 0xEC in CP437, and which has no representation in the other three encodings I mentioned. Oh, the joys of single-byte encodings :-)

        Using that codepage, this troublesome C library function (in mpc-1.3.x) by the name of "mpcr_out_str", then displays correctly when accessed from the Math::MPC module.

        It would seem logical to me then that the library is outputting UTF-8.

        Is there some way I can manipulate the active code page in perl (on windows) without shelling out to chcp ?

        I should note I'm not an expert on this topic - but this works for me:

        use warnings; use strict; use open qw/:std :encoding(UTF-8)/; use Win32; Win32::SetConsoleOutputCP(65001); print "\N{U+B1}\N{U+221E}\n";
        I'm thinking that, for Windows only, Math::MPC needs to change the codepage to 65001 before calling this function ... and then it ought also revert the codepage to its original setting immediately after the function has been run.

        CP65001 is UTF-8, and IMHO UTF-8 is probably the most universal, so unless you've got some other funky Unicode stuff going on, I don't think you'd need to change it back, the boilerplate I showed above should be fine for the entire process - that and, according to the sources I found, making sure that your terminal is using a Unicode-capable font.

        Win32::Console::OutputCP( 65001 );
Re: Encoding problem with function in C library
by Anonymous Monk on Dec 22, 2022 at 05:56 UTC
    Hi Rob,

    IMHO, It maybe not a issue about perl. if you want a char display correctly in a windows cmd, you should do below 3 steps:

    1. output correct char as codepage(like cp850)
    2. codepage in CMD is aligned,(cp850)
    3. the font which CMD use can display the char
    I can show any chars as above. you should have a try. ;)

      Sorry, forget to login, Besides, you can right click on cmd bar, and choose properties, then you can see which font&codepage are used.




      I am trying to improve my English skills, if you see a mistake please feel free to reply or /msg me a correction

Re: Encoding problem with function in C library
by Anonymous Monk on Dec 22, 2022 at 07:06 UTC
    chcp 1252

    Number 177 or 0xB1 is hard-coded in your dll, but it's "medium shade block" character in CP850 or CP437 which terminal uses, for (supposedly still used) DOS programs to display their output as expected.

Re: Unicode on windows cmd not displaying properly
by Anonymous Monk on Dec 22, 2022 at 13:00 UTC