kzwix has asked for the wisdom of the Perl Monks concerning the following question:

Hello again, fellow wisdom seekers.

I had opened another thread this morning, and I considered editing a second time, to add this information, but decided against it, for it is quite specific, and could be overlooked - edits are kinda stealthy, after all.

So, it would seem that the "-C" command-line option doesn't always play by the rules. Using the following program:

use utf8; use Encode; $\ = "\n"; my $unicodeScalar = "Je suis une chaîne accentuée là où il faut."; my $cmdLineArg = $ARGV[0]; my $stdInLine = <STDIN>; print '['.Encode::is_utf8($unicodeScalar).'] '.$unicodeScalar; print $unicodeScalar; print '['.Encode::is_utf8($cmdLineArg).'] '.$cmdLineArg; print '['.Encode::is_utf8($stdInLine).'] '.$stdInLine;

I made a test run from both a UTF-8 terminal, and from a LATIN-1 terminal, connecting on the same UTF-8 machine, etc.
I tried running WITH those switches, and WITHOUT those switches.
The "accents_utf8" file contains a single line with a few accentuated letters (which you'll see displayed), and I type on the standard input two accents "éè", followed by a carriage return.

Run WITH the switches

UTF-8 console result:
$ perl -CSDA t.pl `cat accents_utf8` éè [1] Je suis une cha&#9618;ne accentu&#9618;e l&#9618; o&#9618; il faut +. Je suis une cha&#9618;ne accentu&#9618;e l&#9618; o&#9618; il faut. [] éèàùôî [] éè
Latin-1 console result:
perl -CSDA t.pl `cat accents_utf8` éè [1] Je suis une chaîne accentuée là où il faut. Je suis une chaîne accentuée là où il faut. [1] éèà ùôî [1] éè

Run WITHOUT the switches

UTF-8 console:
$ perl t.pl `cat accents_utf8` éè [1] Je suis une cha&#9618;ne accentu&#9618;e l&#9618; o&#9618; il faut +. Je suis une cha&#9618;ne accentu&#9618;e l&#9618; o&#9618; il faut. [] éèàùôî [] éè
Latin-1 console:
$ perl t.pl `cat accents_utf8` éè [1] Je suis une chaîne accentuée là où il faut. Je suis une chaîne accentuée là où il faut. [] éèà ùôî [] éè

I am officially puzzled, now. I mean:

The Perl version I use is 5.10.1, could this behavior be a bug ? Or have I overlooked something major ?


EDIT: As the first comment pointed out, there was a flaw in my testing protocol. This command does what it advertises, no bug here (and, again, my thanks to McA and the other wise people who answered)

Replies are listed 'Best First'.
Re: Even more puzzled than before
by McA (Priest) on Jun 20, 2014 at 16:32 UTC

    Hi

    Just one hint from me: Don't trust the terminal output!

    Allways redirect your output to a file and look at it via hexdump -C whether the encoding is a you expect it.

    In your case I'm pretty sure some errors are adding which let you assume something which is not true. Your very first example works on my machine as expected and not as shown by you.

    Regards
    McA

      Thanks, I guess another layer was indeed messing my results. I've retried the test using the protocol you suggested (redirecting the output to a file, THEN examine the file), and I succeeded.
Re: Even more puzzled than before
by RonW (Parson) on Jun 20, 2014 at 17:16 UTC
    perl -CSDA t.pl `cat accents_utf8`

    Why are you passing the contents of accents_utf8 in the command line instead of having your script read the file directly? This is adding another layer processing, so more uncertainty.

      Well, the reason behind this was to test the behavior of command-line arguments. So, as my terminal would have typed "latin-1" accents, and not UTF-8 sequences, I used the output from the command to get a UTF-8 string.
        This makes a lot of sense.
Re: Even more puzzled than before
by ww (Archbishop) on Jun 20, 2014 at 18:37 UTC
Re: Even more puzzled than before
by aitap (Curate) on Jun 21, 2014 at 05:47 UTC

    By the way, if you omit both use utf8 and -C... switch, Perl will treat your strings like binary data. This will both avert the warnings and make sure that strings come out exactly as you typed them in the source code. Of course, text operations (like substr and pattern matching) will be broken in this case, but if all you want is to print UTF-8 unchanged, this is much easier than making Perl encode and decode them needlessly.

    Your first example works for me, and if I remove use utf8 from it and run it without -C..., it still does.

    For portable text processing scripts check out Encode::Locale.

      binary data with newline translation ... yeah :D
Re: Even more puzzled than before
by Anonymous Monk on Jun 20, 2014 at 17:07 UTC
    Hm, works for me.