Re^4: german Alphabet

Perl assumes ASCII, not latin-1.

$ perl -Mutf8 -MEncode -e'print encode("latin-1", "sub fête {}\n");' \
   | perl
Illegal declaration of subroutine main::f at - line 1.
[download]

If you happen to use an 8-bit byte in string literal, a character with the value of the byte will be created rather than throwing an error.

Comment on Re^4: german Alphabet Download Code

Replies are listed 'Best First'.
Re^5: german Alphabet by Anonymous Monk on Dec 15, 2018 at 19:51 UTC
It might be important to note that when one tries to print a wide string that happens to be representable in latin-1, Perl uses latin-1 with no warnings: `$ perl -w -Mutf8 -E'print "ê"' \| hd 00000000 ea \|.\| 00000001` [download] `"ê"` is decoded into characters but then printed to a handle that doesn't have an `:encode(...)` or `:utf8` IOLayer. Since it's representable in latin-1, the single-byte encoding is used and no warning is shown. $ perl -w -Mutf8 -E'print "ы"' \| hd Wide character in print at -e line 1. 00000000 d1 8b \|..\| 00000002 Similar situation, but `"ы"` cannot be represented in latin-1, so we get a warning and UTF-8 bytes instead. `$ perl -w -E'print "ê"' \| hd 00000000 c3 aa \|..\| 00000002` [download] (My terminal is UTF-8. No decoding or encoding is done in this case, Perl operates on bytes.)	[reply] [d/l] [select]
Re^6: german Alphabet by ikegami (Patriarch) on Dec 16, 2018 at 19:53 UTC
No. Perl never uses latin-1. In the first case (`print "\xEA";`), Perl is expecting bytes, and you provided a string of bytes, so it printed the bytes (as-is). It didn't warn because you provided what was expected. In the second case (`print "\x{44B}";`), Perl is expecting bytes, and you didn't provided a string of bytes, so it guesses that you meant to encode them using UTF-8, does so, and warns. In the third case (`print "\xC3\xAA";`), Perl is expecting bytes, and you provided a string of bytes, so it printed the bytes (as-is). It didn't warn because you provided what was expected. (A string a bytes is a string consisting of entirely characters with a value less than 256.)	[reply] [d/l] [select]
Re^7: german Alphabet by Anonymous Monk on Dec 16, 2018 at 21:47 UTC
I think I understand it now: decoding `"\xC3\xAA"` from UTF-8 creates a code-point with a value less than 256, `U+00EA`, and `"\xEA"` just happens to be latin-1 for the same code point because of the way Unicode has been designed, not a Perl quirk. Thank you for correcting me.	[reply] [d/l] [select]