Even more puzzled than before

kzwix has asked for the wisdom of the Perl Monks concerning the following question:

Hello again, fellow wisdom seekers.

I had opened another thread this morning, and I considered editing a second time, to add this information, but decided against it, for it is quite specific, and could be overlooked - edits are kinda stealthy, after all.

So, it would seem that the "-C" command-line option doesn't always play by the rules. Using the following program:

use utf8;
use Encode;

$\ = "\n";
my $unicodeScalar = "Je suis une chaîne accentuée là où il faut.";
my $cmdLineArg = $ARGV[0];
my $stdInLine = <STDIN>;

print '['.Encode::is_utf8($unicodeScalar).'] '.$unicodeScalar;
print $unicodeScalar;
print '['.Encode::is_utf8($cmdLineArg).'] '.$cmdLineArg;
print '['.Encode::is_utf8($stdInLine).'] '.$stdInLine;
[download]

I made a test run from both a UTF-8 terminal, and from a LATIN-1 terminal, connecting on the same UTF-8 machine, etc.
I tried running WITH those switches, and WITHOUT those switches.
The "accents_utf8" file contains a single line with a few accentuated letters (which you'll see displayed), and I type on the standard input two accents "éè", followed by a carriage return.

Run WITH the switches

UTF-8 console result:

$ perl -CSDA t.pl `cat accents_utf8`
éè
[1] Je suis une cha&#9618;ne accentu&#9618;e l&#9618; o&#9618; il faut
+.
Je suis une cha&#9618;ne accentu&#9618;e l&#9618; o&#9618; il faut.
[] éèàùôî
[] éè
[download]

Latin-1 console result:

perl -CSDA t.pl `cat accents_utf8`
éè
[1] Je suis une chaÃ®ne accentuÃ©e lÃ  oÃ¹ il faut.
Je suis une chaÃ®ne accentuÃ©e lÃ  oÃ¹ il faut.
[1] Ã©Ã¨Ã Ã¹Ã´Ã®
[1] éè
[download]

Run WITHOUT the switches

UTF-8 console:

$ perl t.pl `cat accents_utf8`
éè
[1] Je suis une cha&#9618;ne accentu&#9618;e l&#9618; o&#9618; il faut
+.
Je suis une cha&#9618;ne accentu&#9618;e l&#9618; o&#9618; il faut.
[] éèàùôî
[] éè
[download]

Latin-1 console:

$ perl t.pl `cat accents_utf8`
éè
[1] Je suis une chaîne accentuée là où il faut.
Je suis une chaîne accentuée là où il faut.
[] Ã©Ã¨Ã Ã¹Ã´Ã®
[] éè
[download]

I am officially puzzled, now. I mean:

- Why the hell would the UTF-8 console results ignore the presence of these switches ? I mean, if I explictly tell Perl that entries are UTF-8, and I use characters that ARE UTF-8, then why are they internally stored as Latin-1 (the utf8 bit not set) ?
- Also, if the switch makes the output UTF-8, then why is my text garbled when printed, when it is internally perfectly sound UTF-8 ?
On the Latin-1 console, I see a difference... but only on the output.
- Without these switches, the result is legible when displaying the inner string (which is UTF-8), or what I entered manually on the console (in Latin-1), but the UTF-8 characters I provided from the "accents_utf8" file are garbled.
- With these switches, instead, it seems to work fine (it displays UTF-8 from the script, or from arguments, as real UTF-8, and the STDIN data I entered is "deemed to be UTF-8" (even though it isn't, really - it would be invalid UTF-8), and, as such, sets the utf8 bit in the scalars. Then it gets displayed as it should be.

The Perl version I use is 5.10.1, could this behavior be a bug ? Or have I overlooked something major ?

EDIT: As the first comment pointed out, there was a flaw in my testing protocol. This command does what it advertises, no bug here (and, again, my thanks to McA and the other wise people who answered)

Comment on Even more puzzled than before Select or Download Code

Replies are listed 'Best First'.
Re: Even more puzzled than before by McA (Priest) on Jun 20, 2014 at 16:32 UTC
Hi Just one hint from me: Don't trust the terminal output! Allways redirect your output to a file and look at it via `hexdump -C` whether the encoding is a you expect it. In your case I'm pretty sure some errors are adding which let you assume something which is not true. Your very first example works on my machine as expected and not as shown by you. Regards McA	[reply] [d/l]
Re^2: Even more puzzled than before by kzwix (Sexton) on Jun 23, 2014 at 07:45 UTC
Thanks, I guess another layer was indeed messing my results. I've retried the test using the protocol you suggested (redirecting the output to a file, THEN examine the file), and I succeeded.	[reply]
Re: Even more puzzled than before by RonW (Parson) on Jun 20, 2014 at 17:16 UTC
perl -CSDA t.pl `cat accents_utf8` Why are you passing the contents of accents_utf8 in the command line instead of having your script read the file directly? This is adding another layer processing, so more uncertainty.	[reply]
Re^2: Even more puzzled than before by kzwix (Sexton) on Jun 23, 2014 at 07:36 UTC
Well, the reason behind this was to test the behavior of command-line arguments. So, as my terminal would have typed "latin-1" accents, and not UTF-8 sequences, I used the output from the command to get a UTF-8 string.	[reply]
Re^3: Even more puzzled than before by RonW (Parson) on Jun 24, 2014 at 05:02 UTC
This makes a lot of sense.	[reply]
Re: Even more puzzled than before by ww (Archbishop) on Jun 20, 2014 at 18:37 UTC
OP advised by msg: cite the prior node... so here it is: Default encoding rules leave me puzzled... check Ln42!	[reply]
Re: Even more puzzled than before by aitap (Curate) on Jun 21, 2014 at 05:47 UTC
By the way, if you omit both `use utf8` and `-C...` switch, Perl will treat your strings like binary data. This will both avert the warnings and make sure that strings come out exactly as you typed them in the source code. Of course, text operations (like substr and pattern matching) will be broken in this case, but if all you want is to print UTF-8 unchanged, this is much easier than making Perl encode and decode them needlessly. Your first example works for me, and if I remove `use utf8` from it and run it without `-C...`, it still does. For portable text processing scripts check out Encode::Locale.	[reply] [d/l] [select]
Re^2: Even more puzzled than before by Anonymous Monk on Jun 21, 2014 at 08:50 UTC
binary data with newline translation ... yeah :D	[reply]
Re: Even more puzzled than before by Anonymous Monk on Jun 20, 2014 at 17:07 UTC
Hm, works for me.	[reply]