hexcoder has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

when decoding some binary log files on windows, I wanted to

So, I thought how hard can that be?

From the documentation I understood binmode() to be compatible with open() regarding I/O-layers, but my results are strangely different.

I wrote this small test program to output some non-ASCII characters (german umlauts). It checks if STDOUT is redirected, and should adjust the encoding accordingly.
use strict; use warnings; use Win32::Console::Ansi; # converts output for code page 850 to OEM c +ode page my $isRedirected = ! -t STDOUT; my $Str = 'äöüÄÖÜß' . "\n"; # this string has default encoding iso-885 +9-1 # if STDOUT is redirected to a file, use unicode encoding, otherwise u +se default if ($isRedirected) { if (!defined binmode STDOUT, ':encoding(UTF-8)') { warn "binmode failed: $!\n"; } utf8::encode($Str); } # output string with a hexdump if (!print 'string: ', hd($Str), $Str){ warn "print failed\n"; } # With a redirected file I end up with a UTF-16 LE BOM encoded file wi +th wrong content :-( # for comparison, this works as expected (produces utf8 content) open(my $fh, '>:encoding(UTF-8)', 'utf8_2') or die "can't open file fo +r writing:$!\n"; print $fh $Str or warn "print to file failed\n"; close $fh or warn "close file failed\n"; # hexdump sub hd { my $input = shift; return join(' ', unpack('(H2)*', $input)), "\n"; }
The written file 'utf_2' contains the expected output

äöüÄÖÜß

which is

000000 c3 a4 c3 b6 c3 bc c3 84 c3 96 c3 9c c3 9f 0d 0a

while the redirected STDOUT output looks very different.

string: e4 f6 fc c4 d6 dc df 0a ├ñ├Â├╝├ä├û├£├ƒ
The second line has this hexdump:

000000 1c 25 f1 00 1c 25 c2 00 1c 25 5d 25 1c 25 e4 00 000010 1c 25 fb 00 1c 25 a3 00 1c 25 92 01 0d 00 0a 00
What happened here? How did I end up with a UTF-16 LE BOM version? What would you suggest to obtain UTF8 encoding for the redirected file?

Thanks for any enlightenment!

Replies are listed 'Best First'.
Re: binmode(':encoding(UTF-8)') did not produce utf8 for me
by ikegami (Patriarch) on Jul 03, 2023 at 23:26 UTC

    First, you claim you source code contains

    my $Str = 'äöüÄÖÜß' . "\n";

    Perl expects source encoded using UTF-8 when use utf8; is in effect.
    Perl expects source encoded using ASCII when use utf8; isn't.

    Seeing as you didn't use use utf8;, and seeing as those characters aren't ASCII characters, your script couldn't possibly contain those characters.

    Based on the hd output, it appears that your source code is encoded using ISO-8859-1 as you say, or more likely Windows-1252 since you're on a Windows machine.


    Secondly, the Win32 console is extremely unlikely to want iso-8859-1 or Windows-1252. "cp" . Win32::GetConsoleOutputCP() will produce the correct encoding.


    If your script is encoded using Windows-1252 or ISO-8859-1 (as it currently is), you want this:

    use v5.14; use warnings; use Encode qw( decode ); use Win32 qw( ); my $enc = -t STDOUT ? "cp" . Win32::GetConsoleOutputCP() : "UTF-8"; binmode( STDOUT, ":encoding($enc)" ); my $str = decode( "cp1252", "äöüÄÖÜß" ); #warn sprintf "%vX", $str; # E4.F6.FC.C4.D6.DC.DF say $str;

    If your script is encoded using UTF-8 (the modern standard), you want this:

    use v5.14; use warnings; use utf8; use Win32 qw( ); my $enc = -t STDOUT ? "cp" . Win32::GetConsoleOutputCP() : "UTF-8"; binmode( STDOUT, ":encoding($enc)" ); my $str = "äöüÄÖÜß"; #warn sprintf "%vX", $str; # E4.F6.FC.C4.D6.DC.DF say $str;

    Either way, you get this:

    >chcp Active code page: 65001 # My machine use UTF-8 for the console. >perl a.pl äöüÄÖÜß >chcp 850 Active code page: 850 # It used to default to this. >perl a.pl äöüÄÖÜß >chcp 437 Active code page: 437 # Common in the US. >perl a.pl äöüÄÖÜß >perl a.pl >a >perl -Mv5.14 -ne"say sprintf '%vX', $_" a C3.A4.C3.B6.C3.BC.C3.84.C3.96.C3.9C.C3.9F.A # UTF-8 of äöüÄÖÜß

    See Re: [OT] ASCII, cmd.exe, linux console, charset, code pages, fonts and other amenities.

      Hello ikegami,

      thanks very much for a fast response!

      I ran the first of your scripts (without 'use utf8;') in a PowerShell console with code page 850 in Strawberry Perl 5.32.1 32-bit. Without redirection I get the same output as you.

      With redirection however I get this file content:

      000000 ff fe 1c 25 f1 00 1c 25 c2 00 1c 25 5d 25 1c 25 000010 e4 00 1c 25 fb 00 1c 25 a3 00 1c 25 92 01 0d 00 000020 0a 00

      This is windows 10 version 21H2 (Build 19044.3086).

      That made me curious, and I repeated the run in a plain CMD.EXE shell. And surprise! I then get the same output as you did.

      So something in PowerShell disturbs the output during redirection, it seems. That nasty behavior took me by surprise. It looks like that is where I get the unwanted extra UTF-16 LE BOM conversion from...

      Another thing I noticed with PowerShell was that perl "-Mv5.14" -ne"say sprintf '%vX', $_" a did only produce two empty lines.

      Thanks very much again for helping me out!

      > Perl expects source encoded using UTF-8 when use utf8; is in effect.
      > Perl expects source encoded using ASCII when use utf8; isn't.

      I don't think that's a very helpful way of looking at it. I'd say "Perl upgrades any literal strings in (utf-8) source code to character semantics when use utf8; is in effect" would be more helpful.

      In the end, it's about whether the strings you are working with are using byte semantics or character semantics. Because binmode ":encoding()" only works with strings with character semantics and does nothing with byte semantics.

      It takes some work to get used to this byte/character semantics distinction.

        That wouldn't be more useful, since that's not what it does.

        Perl decodes from UTF-8 with, and it decodes from ASCII (with 8-bit clean literals) without. And it does that for the entire source code, not just literals. And the literals don't necessarily use the upgraded format, even with use utf-8.

        Your explanation is simply completely wrong.


        In the end, it's about whether the strings you are working with are using byte semantics or character semantics.

        No. It very much isn't. It affects the encoding used to decode the entire code, not the internal storage format of literals.

        $ perl -Mv5.14 -e'use utf8; sub fée { }' $ perl -Mv5.14 -e'no utf8; sub fée { }' Illegal declaration of subroutine main::f at -e line 1.

        Because binmode ":encoding()" only works with strings with character semantics and does nothing with byte semantics.

        That's not true either. It works for both.

        $ perl -Mv5.14 -e' binmode STDOUT, ":encoding(UTF-8)"; $_ = "\xE9"; utf8::upgrade($_); say; ' | od -t x1 0000000 c3 a9 0a 0000003 $ perl -Mv5.14 -e' binmode STDOUT, ":encoding(UTF-8)"; $_ = "\xE9"; utf8::downgrade($_); say; ' | od -t x1 0000000 c3 a9 0a 0000003

        "Byte semantics" and "Unicode semantics" are (confusing and misleading) terms used to describe code suffering from The Unicode Bug. :encoding does not suffer from The Unicode Bug.

        :encoding is not even being discussed!

Re: binmode(':encoding(UTF-8)') did not produce utf8 for me
by NERDVANA (Priest) on Jul 04, 2023 at 01:02 UTC
    Ikegami has a nice explanation, but I want to add that
    if ($isRedirected) { if (!defined binmode STDOUT, ':encoding(UTF-8)') { warn "binmode failed: $!\n"; } utf8::encode($Str); # <--- BUG HERE }

    The function utf8::encode changes the bytes/characters that the string contains. So if you start out with 8 bytes in the string which you happen to know are a ISO8859-1 representation of your characters, then calling utf8::encode is going to result in 15 bytes which will be your characters represented in UTF-8, and then when you print on STDOUT those bytes get encoded as if they were each a unicode character, resulting in about 30-40 bytes.

    In other words, if you put the utf8 layer on a file handle, don't also encode things prior to printing them.

    If that doesn't make sense, you need to first understand that Perl doesn't *track* character encodings of its strings, or even whether they are meant to be bytes or characters. It requires *you* to keep track of that and ask for the appropriate conversions at the appropriate points. The best advice for dealing with that requirement is to try to make sure you always have unicode in your strings and only encode or decode at the edges of your program as data comes in or goes out (such as with filehandle layers) ...and make sure your string constants are in a source file with "use utf8" written by a text editor that writes utf8.

    Edit: to clarify, if you happen to have some characters in a string which came from a Windows codepage that doesn't match unicode's definition for 0x80-0xFF, then you first need to decode that, (to get unicode) before re-encoding as utf8. The ":encoding(UTF-8)" layer makes the assumption that you start from unicode codepoints, and will generate garbage if you started from something different.