in reply to Re^3: german Alphabet
in thread german Alphabet

Perl assumes ASCII, not latin-1.

$ perl -Mutf8 -MEncode -e'print encode("latin-1", "sub fête {}\n");' \ | perl Illegal declaration of subroutine main::f at - line 1.

If you happen to use an 8-bit byte in string literal, a character with the value of the byte will be created rather than throwing an error.

Replies are listed 'Best First'.
Re^5: german Alphabet
by Anonymous Monk on Dec 15, 2018 at 19:51 UTC
    It might be important to note that when one tries to print a wide string that happens to be representable in latin-1, Perl uses latin-1 with no warnings:
    $ perl -w -Mutf8 -E'print "ê"' | hd 00000000 ea |.| 00000001
    "ê" is decoded into characters but then printed to a handle that doesn't have an :encode(...) or :utf8 IOLayer. Since it's representable in latin-1, the single-byte encoding is used and no warning is shown.
    $ perl -w -Mutf8 -E'print "ы"' | hd
    Wide character in print at -e line 1.
    00000000  d1 8b                                             |..|
    00000002
    
    Similar situation, but "ы" cannot be represented in latin-1, so we get a warning and UTF-8 bytes instead.
    $ perl -w -E'print "ê"' | hd 00000000 c3 aa |..| 00000002
    (My terminal is UTF-8. No decoding or encoding is done in this case, Perl operates on bytes.)

      No. Perl never uses latin-1.

      In the first case (print "\xEA";), Perl is expecting bytes, and you provided a string of bytes, so it printed the bytes (as-is). It didn't warn because you provided what was expected.

      In the second case (print "\x{44B}";), Perl is expecting bytes, and you didn't provided a string of bytes, so it guesses that you meant to encode them using UTF-8, does so, and warns.

      In the third case (print "\xC3\xAA";), Perl is expecting bytes, and you provided a string of bytes, so it printed the bytes (as-is). It didn't warn because you provided what was expected.

      (A string a bytes is a string consisting of entirely characters with a value less than 256.)

        I think I understand it now: decoding "\xC3\xAA" from UTF-8 creates a code-point with a value less than 256, U+00EA, and "\xEA" just happens to be latin-1 for the same code point because of the way Unicode has been designed, not a Perl quirk.

        Thank you for correcting me.