in reply to binmode(':encoding(UTF-8)') did not produce utf8 for me

First, you claim you source code contains

my $Str = 'äöüÄÖÜß' . "\n";

Perl expects source encoded using UTF-8 when use utf8; is in effect.
Perl expects source encoded using ASCII when use utf8; isn't.

Seeing as you didn't use use utf8;, and seeing as those characters aren't ASCII characters, your script couldn't possibly contain those characters.

Based on the hd output, it appears that your source code is encoded using ISO-8859-1 as you say, or more likely Windows-1252 since you're on a Windows machine.


Secondly, the Win32 console is extremely unlikely to want iso-8859-1 or Windows-1252. "cp" . Win32::GetConsoleOutputCP() will produce the correct encoding.


If your script is encoded using Windows-1252 or ISO-8859-1 (as it currently is), you want this:

use v5.14; use warnings; use Encode qw( decode ); use Win32 qw( ); my $enc = -t STDOUT ? "cp" . Win32::GetConsoleOutputCP() : "UTF-8"; binmode( STDOUT, ":encoding($enc)" ); my $str = decode( "cp1252", "äöüÄÖÜß" ); #warn sprintf "%vX", $str; # E4.F6.FC.C4.D6.DC.DF say $str;

If your script is encoded using UTF-8 (the modern standard), you want this:

use v5.14; use warnings; use utf8; use Win32 qw( ); my $enc = -t STDOUT ? "cp" . Win32::GetConsoleOutputCP() : "UTF-8"; binmode( STDOUT, ":encoding($enc)" ); my $str = "äöüÄÖÜß"; #warn sprintf "%vX", $str; # E4.F6.FC.C4.D6.DC.DF say $str;

Either way, you get this:

>chcp Active code page: 65001 # My machine use UTF-8 for the console. >perl a.pl äöüÄÖÜß >chcp 850 Active code page: 850 # It used to default to this. >perl a.pl äöüÄÖÜß >chcp 437 Active code page: 437 # Common in the US. >perl a.pl äöüÄÖÜß >perl a.pl >a >perl -Mv5.14 -ne"say sprintf '%vX', $_" a C3.A4.C3.B6.C3.BC.C3.84.C3.96.C3.9C.C3.9F.A # UTF-8 of äöüÄÖÜß

See Re: [OT] ASCII, cmd.exe, linux console, charset, code pages, fonts and other amenities.

Replies are listed 'Best First'.
Re^2: binmode(':encoding(UTF-8)') did not produce utf8 for me
by hexcoder (Curate) on Jul 04, 2023 at 17:37 UTC
    Hello ikegami,

    thanks very much for a fast response!

    I ran the first of your scripts (without 'use utf8;') in a PowerShell console with code page 850 in Strawberry Perl 5.32.1 32-bit. Without redirection I get the same output as you.

    With redirection however I get this file content:

    000000 ff fe 1c 25 f1 00 1c 25 c2 00 1c 25 5d 25 1c 25 000010 e4 00 1c 25 fb 00 1c 25 a3 00 1c 25 92 01 0d 00 000020 0a 00

    This is windows 10 version 21H2 (Build 19044.3086).

    That made me curious, and I repeated the run in a plain CMD.EXE shell. And surprise! I then get the same output as you did.

    So something in PowerShell disturbs the output during redirection, it seems. That nasty behavior took me by surprise. It looks like that is where I get the unwanted extra UTF-16 LE BOM conversion from...

    Another thing I noticed with PowerShell was that perl "-Mv5.14" -ne"say sprintf '%vX', $_" a did only produce two empty lines.

    Thanks very much again for helping me out!
Re^2: binmode(':encoding(UTF-8)') did not produce utf8 for me
by Anonymous Monk on Jul 06, 2023 at 19:43 UTC

    > Perl expects source encoded using UTF-8 when use utf8; is in effect.
    > Perl expects source encoded using ASCII when use utf8; isn't.

    I don't think that's a very helpful way of looking at it. I'd say "Perl upgrades any literal strings in (utf-8) source code to character semantics when use utf8; is in effect" would be more helpful.

    In the end, it's about whether the strings you are working with are using byte semantics or character semantics. Because binmode ":encoding()" only works with strings with character semantics and does nothing with byte semantics.

    It takes some work to get used to this byte/character semantics distinction.

      That wouldn't be more useful, since that's not what it does.

      Perl decodes from UTF-8 with, and it decodes from ASCII (with 8-bit clean literals) without. And it does that for the entire source code, not just literals. And the literals don't necessarily use the upgraded format, even with use utf-8.

      Your explanation is simply completely wrong.


      In the end, it's about whether the strings you are working with are using byte semantics or character semantics.

      No. It very much isn't. It affects the encoding used to decode the entire code, not the internal storage format of literals.

      $ perl -Mv5.14 -e'use utf8; sub fée { }' $ perl -Mv5.14 -e'no utf8; sub fée { }' Illegal declaration of subroutine main::f at -e line 1.

      Because binmode ":encoding()" only works with strings with character semantics and does nothing with byte semantics.

      That's not true either. It works for both.

      $ perl -Mv5.14 -e' binmode STDOUT, ":encoding(UTF-8)"; $_ = "\xE9"; utf8::upgrade($_); say; ' | od -t x1 0000000 c3 a9 0a 0000003 $ perl -Mv5.14 -e' binmode STDOUT, ":encoding(UTF-8)"; $_ = "\xE9"; utf8::downgrade($_); say; ' | od -t x1 0000000 c3 a9 0a 0000003

      "Byte semantics" and "Unicode semantics" are (confusing and misleading) terms used to describe code suffering from The Unicode Bug. :encoding does not suffer from The Unicode Bug.

      :encoding is not even being discussed!