Encoding problem with function in C library

syphilis has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

A C library that I'm accessing from perl (via XS) conditionally might attempt to print out ± or ∞.
The library code attempts to do this by passing the infinity or the plus-or-minus as a string literal to fprintf().
If the filehandle is stdout, then these symbols are being displayed by my Windows perls as garbage.
No such problem on Linux ... and no such problem with perls running under Cygwin on the very same Windows 11 box. All works there as intended.
In fact, it's also fine on the Windows system if the output filehandle is to a file, or if I redirect the output to a file (rather than allowing it to go to the cmd.exe console)

By way of demonstration, I have the following Inline::C script

use strict;
use warnings;
use utf8;

use Inline C => <<'EOC';

void wprint () {
  printf("+ or -  : ą\n");
}

void foo () {
 unsigned char a = 'ą';
 printf("%d\n", a);
}

EOC

wprint();
foo();
[download]

For me, that outputs:

+ or - : ▒
177

The question is:
Leaving the C code as it is, how do I get the plus-or-minus symbol to display as intended on Windows (cp437 or cp850) when the display is being sent to stdout ?
I thought this would easy ... but not for me. Even using Text::Iconv (which I'm sure that I once understood) is doing nothing. So, FAIK, it might not even be possible ?? (Is it something that can only be handled within the C code ?)

And using stuff that I don't understand (like use open ':std', ':encoding(cp437)') is also having no effect, irrespective of whether I specify cp437, cp850 or cp1252.

According to chcp, my Windows 11 codepage is cp437. On my Windows 7, where the same issue arises, the codepage is cp850.

Any help is much appreciated.

Cheers,
Rob

Comment on Encoding problem with function in C library Select or Download Code

Replies are listed 'Best First'.
Re: Encoding problem with function in C library by haukex (Archbishop) on Dec 22, 2022 at 10:27 UTC
Could you clarify a few things? it's also fine on the Windows system if the output filehandle is to a file, or if I redirect the output to a file (rather than allowing it to go to the cmd.exe console) In both of those cases, what encoding is the resulting file in? (see for example my tool enctool) `use utf8; ... unsigned char a = 'ą';` That looks wrong to me because a `char` is one byte and you've given it two (`C2 B1`), so I don't know if your C example is representative, and `gcc` does complain `warning: multi-character character constant [-Wmultichar]`. In pure Perl, with a combination of `chcp 65001` and `use open qw/:std :encoding(UTF-8)/;`, I can get it to output correctly, but I guess the question is what encoding the library is using (hence my question above). C:\Temp>type test.pl #!perl use warnings; use strict; use open qw/:std :encoding(UTF-8)/; print "\N{U+B1}\N{U+221E}\n"; C:\Temp>chcp Active Codepage: 850. C:\Temp>perl test.pl ┬▒Ôê× C:\Temp>chcp 65001 Active Codepage: 65001. C:\Temp>perl test.pl ±∞	[reply] [d/l] [select]
Re^2: Encoding problem with function in C library by syphilis (Archbishop) on Dec 22, 2022 at 13:06 UTC
That looks wrong to me because a char is one byte and you've given it two (C2 B1),... Yeah - when I do a noisy build (and thereby have any warnings displayed) I see the same warning. I also see `warning: unsigned conversion from 'int' to 'unsigned char' changes value from '49841' to '177' [-Woverflow]`. Then, when the `chr(177)` gets written to a text file, it displays as the desired plus-or-minus symbol when viewed in Windows notepad. In pure Perl, with a combination of chcp 65001 and use open qw/:std :encoding(UTF-8)/;, I can get it to output correctly Thanks for that - I see exactly the same behaviours with your test.pl as you did. The mention of codepage 65001 looks to have been the godsend I was looking for. Using that codepage, this troublesome C library function (in mpc-1.3.x) by the name of "mpcr_out_str", then displays correctly when accessed from the Math::MPC module. Is there some way I can manipulate the active code page in perl (on windows) without shelling out to `chcp` ? I'm thinking that, for Windows only, Math::MPC needs to change the codepage to 65001 before calling this function ... and then it ought also revert the codepage to its original setting immediately after the function has been run. I guess I should also check to see if 65001 ought only be set when Windows prints to stdout. Does that sound reasonable ? (I will check this on a range of perls on Windows 7 and Windows 11 anyway.) Update: Hmmm, a quick check on Windows 7 reveals that changing the codepage is apparently having no effect. Bummer !! (But it's getting late over here, and will have to wait until I've had some sleep, before checking further.) Thanks ever so much for the responses. Things are now a little clearer. Cheers, Rob	[reply] [d/l] [select]
Re^3: Encoding problem with function in C library by Corion (Patriarch) on Dec 22, 2022 at 13:20 UTC
Win32::Console gives you the interface to `->OutputCP`: `#!perl use strict; use warnings; use charnames ':full'; use Win32::Console; my $c = Win32::Console->new(STD_OUTPUT_HANDLE); $c->OutputCP(65001); # we write UTF-8 binmode STDOUT, ':encoding(UTF-8)'; print "\N{INFINITY}\n";` [download] In my tests, the output code page persisted after the program run, so you might (or might not) want to save/restore the codepage: `my $oldCP = $c->OutputCP(); $c->OutputCP(65001); # we write UTF-8 END{ if( $c ) { $c->OutputCP($oldCP); # we write UTF-8 } } ...` [download]	[reply] [d/l] [select]
Re^4: Encoding problem with function in C library (updated) by haukex (Archbishop) on Dec 22, 2022 at 13:35 UTC
Re^5: Encoding problem with function in C library by Corion (Patriarch) on Dec 22, 2022 at 13:43 UTC
Some notes below your chosen depth have not been shown here
Re^3: Encoding problem with function in C library by haukex (Archbishop) on Dec 22, 2022 at 13:32 UTC
Then, when the chr(177) gets written to a text file, it displays as the desired plus-or-minus symbol when viewed in Windows notepad. In this case it just so happens that the UTF-8 encoding of `U+00B1 PLUS-MINUS SIGN` is `C2 B1` and I guess that the C compiler is doing the equivalent of `unsigned char a = 0xC2B1 & 0xFF`. It also just so happens that `0xB1` (177) is the character ± in CP1252, Latin-1, and others (which I guess is Notepad's interpretation), but in CP850 and CP437, `0xB1` is ▒. You'll probably not see this happening with ∞ `U+221E INFINITY`, whose UTF-8 encoding is `E2 88 9E`, but which is `0xEC` in CP437, and which has no representation in the other three encodings I mentioned. Oh, the joys of single-byte encodings `:-)` Using that codepage, this troublesome C library function (in mpc-1.3.x) by the name of "mpcr_out_str", then displays correctly when accessed from the Math::MPC module. It would seem logical to me then that the library is outputting UTF-8. Is there some way I can manipulate the active code page in perl (on windows) without shelling out to chcp ? I should note I'm not an expert on this topic - but this works for me: `use warnings; use strict; use open qw/:std :encoding(UTF-8)/; use Win32; Win32::SetConsoleOutputCP(65001); print "\N{U+B1}\N{U+221E}\n";` [download] I'm thinking that, for Windows only, Math::MPC needs to change the codepage to 65001 before calling this function ... and then it ought also revert the codepage to its original setting immediately after the function has been run. CP65001 is UTF-8, and IMHO UTF-8 is probably the most universal, so unless you've got some other funky Unicode stuff going on, I don't think you'd need to change it back, the boilerplate I showed above should be fine for the entire process - that and, according to the sources I found, making sure that your terminal is using a Unicode-capable font.	[reply] [d/l] [select]
Re^4: Encoding problem with function in C library by syphilis (Archbishop) on Dec 23, 2022 at 00:09 UTC
Re^5: Encoding problem with function in C library by haukex (Archbishop) on Dec 23, 2022 at 08:40 UTC
Some notes below your chosen depth have not been shown here
Re^5: Encoding problem with function in C library by hippo (Archbishop) on Dec 23, 2022 at 08:16 UTC
Some notes below your chosen depth have not been shown here
Re^3: Encoding problem with function in C library by Anonymous Monk on Dec 22, 2022 at 13:18 UTC
Win32::Console::OutputCP( 65001 );	[reply]
Re: Encoding problem with function in C library by Anonymous Monk on Dec 22, 2022 at 05:56 UTC
Hi Rob, IMHO, It maybe not a issue about perl. if you want a char display correctly in a windows cmd, you should do below 3 steps: output correct char as codepage(like cp850) codepage in CMD is aligned,(cp850) the font which CMD use can display the char I can show any chars as above. you should have a try. ;)	[reply]
Re^2: Encoding problem with function in C library by xiaoyafeng (Deacon) on Dec 22, 2022 at 06:00 UTC
Sorry, forget to login, Besides, you can right click on cmd bar, and choose properties, then you can see which font&codepage are used. I am trying to improve my English skills, if you see a mistake please feel free to reply or /msg me a correction	[reply]
Re: Encoding problem with function in C library by Anonymous Monk on Dec 22, 2022 at 07:06 UTC
`chcp 1252` Number 177 or 0xB1 is hard-coded in your dll, but it's "medium shade block" character in CP850 or CP437 which terminal uses, for (supposedly still used) DOS programs to display their output as expected.	[reply] [d/l]
Re: Unicode on windows cmd not displaying properly by Anonymous Monk on Dec 22, 2022 at 13:00 UTC
Get smart ;) Re: Windows console mangles UTF8 output Re: Printing Unicode on the Windows Console and the importance of of i/o layers Re: How to print utf8 char in Term::Screen::Win32 ? Re: Perl, DOS and encodings	[reply]