Skeeve has asked for the wisdom of the Perl Monks concerning the following question:

I wrote a script which reads an UTF8 File, extracts data, and writes out some more UTF8 files.

The strange thing is: The resulting files are broken. They seem to no longer be UTF8.

I boild the issue down to this small script which just pipes the input through:

#!/usr/bin/perl use strict; use warnings; binmode(STDIN, ':utf8'); open my $out,'>:utf8','testfile.txt'; while (<>) { print $out $_; } close $out;

Having an input file containing

trallalala

äöüÄÖÜß

And calling my script with:

./encodingtrouble.pl input-encodingtrouble

The resulting output testfile.txt looks like shwn below (There seem to be some unprintable characters between the last Ãs)

trallalala

äöüÃÃÃÃ

When I do not open '>:utf8', the output looks correct, but I'm puzlled about what's going on here. What am I doing wrong? Where is my misunderstanding?


s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

Replies are listed 'Best First'.
Re: encoding trouble
by Your Mother (Archbishop) on Apr 26, 2019 at 14:48 UTC

    Essentially, you are reencoding UTF-8 as UTF-8. This is an alternative example of what's going on with the O being the output layer–

    perl -E 'say "äöüÄÖÜß"'
    äöüÄÖÜß
    
    perl -CO -E 'say "äöüÄÖÜß"'
    äöüÃÃÃÃ
    

    Add this to your script, use open ":std", ":encoding(utf8)";

    open explains what's going on.

Re: encoding trouble
by Eily (Monsignor) on Apr 26, 2019 at 14:43 UTC

    Because @ARGV is not empty when you read with <>, the handle that is read from is not STDIN but ARGV instead.

    use open IO => ':utf8'; may work better (ARGV is only opened by <> at which point it's to late to call binmode because some data as already been read.)

    s/perlvar/ARGV/, thanks choroba

Re: encoding trouble
by karlgoethebier (Abbot) on Apr 26, 2019 at 16:29 UTC