Re^2: Encoding problem with function in C library

That looks wrong to me because a char is one byte and you've given it two (C2 B1),...

Yeah - when I do a noisy build (and thereby have any warnings displayed) I see the same warning.
I also see warning: unsigned conversion from 'int' to 'unsigned char' changes value from '49841' to '177' [-Woverflow].
Then, when the chr(177) gets written to a text file, it displays as the desired plus-or-minus symbol when viewed in Windows notepad.

In pure Perl, with a combination of chcp 65001 and use open qw/:std :encoding(UTF-8)/;, I can get it to output correctly

Thanks for that - I see exactly the same behaviours with your test.pl as you did.

The mention of codepage 65001 looks to have been the godsend I was looking for.
Using that codepage, this troublesome C library function (in mpc-1.3.x) by the name of "mpcr_out_str", then displays correctly when accessed from the Math::MPC module.

Is there some way I can manipulate the active code page in perl (on windows) without shelling out to chcp ?
I'm thinking that, for Windows only, Math::MPC needs to change the codepage to 65001 before calling this function ... and then it ought also revert the codepage to its original setting immediately after the function has been run.
I guess I should also check to see if 65001 ought only be set when Windows prints to stdout.
Does that sound reasonable ? (I will check this on a range of perls on Windows 7 and Windows 11 anyway.)
Update: Hmmm, a quick check on Windows 7 reveals that changing the codepage is apparently having no effect. Bummer !! (But it's getting late over here, and will have to wait until I've had some sleep, before checking further.)

Thanks ever so much for the responses. Things are now a little clearer.

Cheers,
Rob

Comment on Re^2: Encoding problem with function in C library Select or Download Code

Replies are listed 'Best First'.
Re^3: Encoding problem with function in C library by Corion (Patriarch) on Dec 22, 2022 at 13:20 UTC
Win32::Console gives you the interface to `->OutputCP`: `#!perl use strict; use warnings; use charnames ':full'; use Win32::Console; my $c = Win32::Console->new(STD_OUTPUT_HANDLE); $c->OutputCP(65001); # we write UTF-8 binmode STDOUT, ':encoding(UTF-8)'; print "\N{INFINITY}\n";` [download] In my tests, the output code page persisted after the program run, so you might (or might not) want to save/restore the codepage: `my $oldCP = $c->OutputCP(); $c->OutputCP(65001); # we write UTF-8 END{ if( $c ) { $c->OutputCP($oldCP); # we write UTF-8 } } ...` [download]	[reply] [d/l] [select]
Re^4: Encoding problem with function in C library (updated) by haukex (Archbishop) on Dec 22, 2022 at 13:35 UTC
In my tests, the output code page persisted after the program run ~~Interesting, in my test on Win 10 Pro with `Win32::SetConsoleOutputCP(65001)`, the codepage change didn't persist.~~ Update: Sorry, nevermind, see my reply below!	[reply] [d/l]
Re^5: Encoding problem with function in C library by Corion (Patriarch) on Dec 22, 2022 at 13:43 UTC
That's really interesting, since it certainly persists for me. I expected the whole console window to change output based on the strings, but I get the following output: tmp.pl `#!perl use strict; use warnings; use charnames ':full'; use Win32::Console; binmode STDOUT, ':encoding(UTF-8)'; print "\N{INFINITY}\n";` [download] tmp2.pl `#!perl use strict; use warnings; use charnames ':full'; use Win32; Win32::SetConsoleOutputCP(65001); binmode STDOUT, ':encoding(UTF-8)'; print "\N{INFINITY}\n";` [download] And the console output: C:>perl q:\tmp.pl дъз C:>perl q:\tmp2.pl ∞ C:>perl q:\tmp.pl ∞ C:>chcp Aktive Codepage: 850. I'd expect the code page not to persist, and the output of CHCP does indicate that, but the terminal output / interpretation of the programs does indicate that after the first change to UTF-8, the output of subsequent programs is also interpreted as UTF-8 ...	[reply] [d/l] [select]
Re^6: Encoding problem with function in C library by haukex (Archbishop) on Dec 22, 2022 at 13:54 UTC
Re^3: Encoding problem with function in C library by haukex (Archbishop) on Dec 22, 2022 at 13:32 UTC
Then, when the chr(177) gets written to a text file, it displays as the desired plus-or-minus symbol when viewed in Windows notepad. In this case it just so happens that the UTF-8 encoding of `U+00B1 PLUS-MINUS SIGN` is `C2 B1` and I guess that the C compiler is doing the equivalent of `unsigned char a = 0xC2B1 & 0xFF`. It also just so happens that `0xB1` (177) is the character ± in CP1252, Latin-1, and others (which I guess is Notepad's interpretation), but in CP850 and CP437, `0xB1` is ▒. You'll probably not see this happening with ∞ `U+221E INFINITY`, whose UTF-8 encoding is `E2 88 9E`, but which is `0xEC` in CP437, and which has no representation in the other three encodings I mentioned. Oh, the joys of single-byte encodings `:-)` Using that codepage, this troublesome C library function (in mpc-1.3.x) by the name of "mpcr_out_str", then displays correctly when accessed from the Math::MPC module. It would seem logical to me then that the library is outputting UTF-8. Is there some way I can manipulate the active code page in perl (on windows) without shelling out to chcp ? I should note I'm not an expert on this topic - but this works for me: `use warnings; use strict; use open qw/:std :encoding(UTF-8)/; use Win32; Win32::SetConsoleOutputCP(65001); print "\N{U+B1}\N{U+221E}\n";` [download] I'm thinking that, for Windows only, Math::MPC needs to change the codepage to 65001 before calling this function ... and then it ought also revert the codepage to its original setting immediately after the function has been run. CP65001 is UTF-8, and IMHO UTF-8 is probably the most universal, so unless you've got some other funky Unicode stuff going on, I don't think you'd need to change it back, the boilerplate I showed above should be fine for the entire process - that and, according to the sources I found, making sure that your terminal is using a Unicode-capable font.	[reply] [d/l] [select]
Re^4: Encoding problem with function in C library by syphilis (Archbishop) on Dec 23, 2022 at 00:09 UTC
`use warnings; use strict; use open qw/:std :encoding(UTF-8)/; use Win32; Win32::SetConsoleOutputCP(65001); print "\N{U+B1}\N{U+221E}\n";` [download] That works nicely on Windows 10 and 11. But not on Windows 7, where I find that altering the codepage ostensibly succeeds, but in reality takes no effect. Perhaps the explanation for that might be found in one of AM's links. Anyway, I can probably ignore this issue with Windows 7 and earlier. It's unlikely that anyone other than me would ever hit it. ... so unless you've got some other funky Unicode stuff going on, I don't think you'd need to change it back Yes, I think so. It seems that `Win32::SetConsoleOutputCP(65001)` sets the codepage for the duration of the program and that should generally be fine, whereas `chcp 65001` sets it for the duration of the cmd.exe console (and that's not so acceptable). Thanks again for the pointers, guys !! Cheers, Rob	[reply] [d/l] [select]
Re^5: Encoding problem with function in C library by haukex (Archbishop) on Dec 23, 2022 at 08:40 UTC
It seems that Win32::SetConsoleOutputCP(65001) sets the codepage for the duration of the program As Corion pointed out, that's unfortunately not the case, the change does persist and you'll have to do something in an `END` block like he showed.	[reply] [d/l]
Re^6: Encoding problem with function in C library by syphilis (Archbishop) on Dec 23, 2022 at 10:12 UTC
Re^5: Encoding problem with function in C library by hippo (Archbishop) on Dec 23, 2022 at 08:16 UTC
Anyway, I can probably ignore this issue with Windows 7 and earlier 7 is already long past standard EoL and goes full EoL on the 10th of January (ie, less than 3 weeks from now) so yes it should definitely just be ignored. 🦛	[reply]
Re^6: Encoding problem with function in C library by syphilis (Archbishop) on Dec 23, 2022 at 10:52 UTC
Re^3: Encoding problem with function in C library by Anonymous Monk on Dec 22, 2022 at 13:18 UTC
Win32::Console::OutputCP( 65001 );	[reply]