in reply to Encoding problem with function in C library

Could you clarify a few things?

it's also fine on the Windows system if the output filehandle is to a file, or if I redirect the output to a file (rather than allowing it to go to the cmd.exe console)

In both of those cases, what encoding is the resulting file in? (see for example my tool enctool)

use utf8; ... unsigned char a = '±';

That looks wrong to me because a char is one byte and you've given it two (C2 B1), so I don't know if your C example is representative, and gcc does complain warning: multi-character character constant [-Wmultichar].

In pure Perl, with a combination of chcp 65001 and use open qw/:std :encoding(UTF-8)/;, I can get it to output correctly, but I guess the question is what encoding the library is using (hence my question above).

C:\Temp>type test.pl
#!perl
use warnings;
use strict;
use open qw/:std :encoding(UTF-8)/;

print "\N{U+B1}\N{U+221E}\n";

C:\Temp>chcp
Active Codepage: 850.

C:\Temp>perl test.pl
┬▒Ôê×

C:\Temp>chcp 65001
Active Codepage: 65001.

C:\Temp>perl test.pl
±∞

Replies are listed 'Best First'.
Re^2: Encoding problem with function in C library
by syphilis (Archbishop) on Dec 22, 2022 at 13:06 UTC
    That looks wrong to me because a char is one byte and you've given it two (C2 B1),...

    Yeah - when I do a noisy build (and thereby have any warnings displayed) I see the same warning.
    I also see warning: unsigned conversion from 'int' to 'unsigned char' changes value from '49841' to '177' [-Woverflow].
    Then, when the chr(177) gets written to a text file, it displays as the desired plus-or-minus symbol when viewed in Windows notepad.

    In pure Perl, with a combination of chcp 65001 and use open qw/:std :encoding(UTF-8)/;, I can get it to output correctly

    Thanks for that - I see exactly the same behaviours with your test.pl as you did.

    The mention of codepage 65001 looks to have been the godsend I was looking for.
    Using that codepage, this troublesome C library function (in mpc-1.3.x) by the name of "mpcr_out_str", then displays correctly when accessed from the Math::MPC module.

    Is there some way I can manipulate the active code page in perl (on windows) without shelling out to chcp ?
    I'm thinking that, for Windows only, Math::MPC needs to change the codepage to 65001 before calling this function ... and then it ought also revert the codepage to its original setting immediately after the function has been run.
    I guess I should also check to see if 65001 ought only be set when Windows prints to stdout.
    Does that sound reasonable ? (I will check this on a range of perls on Windows 7 and Windows 11 anyway.)
    Update: Hmmm, a quick check on Windows 7 reveals that changing the codepage is apparently having no effect. Bummer !! (But it's getting late over here, and will have to wait until I've had some sleep, before checking further.)

    Thanks ever so much for the responses. Things are now a little clearer.

    Cheers,
    Rob

      Win32::Console gives you the interface to ->OutputCP:

      #!perl use strict; use warnings; use charnames ':full'; use Win32::Console; my $c = Win32::Console->new(STD_OUTPUT_HANDLE); $c->OutputCP(65001); # we write UTF-8 binmode STDOUT, ':encoding(UTF-8)'; print "\N{INFINITY}\n";

      In my tests, the output code page persisted after the program run, so you might (or might not) want to save/restore the codepage:

      my $oldCP = $c->OutputCP(); $c->OutputCP(65001); # we write UTF-8 END{ if( $c ) { $c->OutputCP($oldCP); # we write UTF-8 } } ...
        In my tests, the output code page persisted after the program run

        Interesting, in my test on Win 10 Pro with Win32::SetConsoleOutputCP(65001), the codepage change didn't persist.

        Update: Sorry, nevermind, see my reply below!

      Then, when the chr(177) gets written to a text file, it displays as the desired plus-or-minus symbol when viewed in Windows notepad.

      In this case it just so happens that the UTF-8 encoding of U+00B1 PLUS-MINUS SIGN is C2 B1 and I guess that the C compiler is doing the equivalent of unsigned char a = 0xC2B1 & 0xFF. It also just so happens that 0xB1 (177) is the character ± in CP1252, Latin-1, and others (which I guess is Notepad's interpretation), but in CP850 and CP437, 0xB1 is ▒. You'll probably not see this happening with ∞ U+221E INFINITY, whose UTF-8 encoding is E2 88 9E, but which is 0xEC in CP437, and which has no representation in the other three encodings I mentioned. Oh, the joys of single-byte encodings :-)

      Using that codepage, this troublesome C library function (in mpc-1.3.x) by the name of "mpcr_out_str", then displays correctly when accessed from the Math::MPC module.

      It would seem logical to me then that the library is outputting UTF-8.

      Is there some way I can manipulate the active code page in perl (on windows) without shelling out to chcp ?

      I should note I'm not an expert on this topic - but this works for me:

      use warnings; use strict; use open qw/:std :encoding(UTF-8)/; use Win32; Win32::SetConsoleOutputCP(65001); print "\N{U+B1}\N{U+221E}\n";
      I'm thinking that, for Windows only, Math::MPC needs to change the codepage to 65001 before calling this function ... and then it ought also revert the codepage to its original setting immediately after the function has been run.

      CP65001 is UTF-8, and IMHO UTF-8 is probably the most universal, so unless you've got some other funky Unicode stuff going on, I don't think you'd need to change it back, the boilerplate I showed above should be fine for the entire process - that and, according to the sources I found, making sure that your terminal is using a Unicode-capable font.

        use warnings; use strict; use open qw/:std :encoding(UTF-8)/; use Win32; Win32::SetConsoleOutputCP(65001); print "\N{U+B1}\N{U+221E}\n";
        That works nicely on Windows 10 and 11. But not on Windows 7, where I find that altering the codepage ostensibly succeeds, but in reality takes no effect.
        Perhaps the explanation for that might be found in one of AM's links.
        Anyway, I can probably ignore this issue with Windows 7 and earlier. It's unlikely that anyone other than me would ever hit it.

        ... so unless you've got some other funky Unicode stuff going on, I don't think you'd need to change it back

        Yes, I think so. It seems that Win32::SetConsoleOutputCP(65001) sets the codepage for the duration of the program and that should generally be fine, whereas chcp 65001 sets it for the duration of the cmd.exe console (and that's not so acceptable).

        Thanks again for the pointers, guys !!

        Cheers,
        Rob
      Win32::Console::OutputCP( 65001 );