Re^4: Default encoding rules leave me puzzled...

Code points is an abstraction, it's an internal Perl thing.

What are you talking about? It has nothing to do with Perl. "e" is formed from the code point U+0065, "é" is formed from code point U+00E9 or from code points U+0065 + U+0301, etc. This is defined by The Unicode Consortium, not by Perl.

It must produce a bunch of bytes.

No, the input must be a string of integers in 0..255, which it is. print has no problem storing those as bytes. iso-latin-1 doesn't factor into it.

In which of the following is does print use iso-latin-1?

use utf8;
my $s1 = inet_aton('195.169.195.171');  print($s1);
my $s2 = encode_utf8("éë");             print($s2);
my $s3 = "Ã©Ã«";                        print($s3);
my $s4 = "\xC3\xA9\xC3\xAB";            print($s4);
[download]

The only two possible answers are "all of them" or "none of them", since print can't tell the difference between those strings.

If you claim that iso-latin-1 is used, then you claim that use utf8; produces iso-latin-1. It doesn't. It produces Unicode code points.

That prints garbage instead of 'ç'.

Because the terminal expects bytes of UTF-8, but it got bytes of Unicode code points.

Comment on Re^4: Default encoding rules leave me puzzled... Select or Download Code

Replies are listed 'Best First'.
Re^5: Default encoding rules leave me puzzled... by Anonymous Monk on Jun 21, 2014 at 12:14 UTC
What are you talking about? It has nothing to do with Perl. "e" is formed from the code point U+0065, "é" is formed from code point U+00E9 or from code points U+0065 + U+0301, etc. This is defined by The Unicode Consortium, not by Perl. And the idea that it's OK to treat OCTET 0xE7 as a substitue for code point U+00E9 is totally not defined by the consortium. No, the input must be a string of integers in 0..255, which it is. print has no problem storing those as bytes. iso-latin-1 doesn't factor into it. OMG. Who cares what print expects. Even Perl (in other parts) thinks that that's ridiculous. `perl -wE 'say "ç" + "ç"'` [download] The operator plus expects numbers, just like print, right? If you claim that iso-latin-1 is used, then you claim that use utf8; produces iso-latin-1. It doesn't. It produces Unicode code points. Printing UNICODE STRINGS (and Perl CAN tell the difference between binary and unicode) on binary STDOUT produces a sequence of octets ENCODED as Latin-1 for code points 0 - 255. The Consortium totally wouldn't approve of that. And that's it. It appears you just don't like the word 'encoding'. Most people would still Perl's behavior 'encoding', that word is certainly good enough for me. You (MAYBE) would've had a point if Perl actually stored unicode codepoint U+00E7 as an octet 0xE7 internally. But we know that it doesn't anyway. Have a nice day.	[reply] [d/l]
Re^6: Default encoding rules leave me puzzled... by ikegami (Patriarch) on Jun 22, 2014 at 00:18 UTC
produces a sequence of octets ENCODED as Latin-1 for code points 0 - 255 It gives the same result, yes, but only by virtue of Unicode code points being rather similar to iso-latin-1, not because `print` does any encoding. `print` does this: `- If any of the elements of the string is larger than 255, - Warn "wide character". - Encode the string using utf8. - For each element of the string, - Print that number as a byte.` [download] The operator plus expects numbers, just like print, right? Two individual numbers, yes. `print` takes two strings of them. The bitwise operators accept either. `$ perl -E'say "ABC" \| " "' abc` [download]	[reply] [d/l] [select]
Re^6: Default encoding rules leave me puzzled... by Anonymous Monk on Jun 21, 2014 at 12:46 UTC
I remembered something. `perl -MScalar::Util=looks_like_number -wE 'use utf8; say looks_like_nu +mber("ç")? "yes" : "no"'` [download]	[reply] [d/l]
Re^7: Default encoding rules leave me puzzled... by Jim (Curate) on Jun 22, 2014 at 19:05 UTC
What output does that Perl command-line script produce? `C:\>chcp Active code page: 437 C:\>perl -MScalar::Util=looks_like_number -wE "use utf8; say looks_lik +e_number('ç')? 'yes' : 'no'" no C:\>bash $ perl -MScalar::Util=looks_like_number -wE 'use utf8; say looks_like_ +number("ç") ? "yes" : "no"' Malformed UTF-8 character (1 byte, need 3, after start byte 0xe7) at - +e line 1. no $ exit C:\>` [download] By posting a command-line script and then not posting the output it produces, you've made no useful point—at least not one that's immediately understandable.	[reply] [d/l]
Re^8: Default encoding rules leave me puzzled... by ikegami (Patriarch) on Jun 25, 2014 at 17:31 UTC
Re^8: Default encoding rules leave me puzzled... by ikegami (Patriarch) on Jun 25, 2014 at 17:32 UTC
Re^7: Default encoding rules leave me puzzled... by Anonymous Monk on Jun 21, 2014 at 13:37 UTC
OMG so that's probably what Perl actually does. Converts in-place the internal representation of the string from UTF-X to the equivalent octet sequence in the native encoding (Latin-1 or EBCDIC). Yeah, binary print works exactly like utf8::downgrade.	[reply]
Re^8: Default encoding rules leave me puzzled... by Anonymous Monk on Jun 21, 2014 at 13:48 UTC
Re^9: Default encoding rules leave me puzzled... by Jim (Curate) on Jun 22, 2014 at 19:08 UTC
Re^5: Default encoding rules leave me puzzled... by Jim (Curate) on Jun 22, 2014 at 19:22 UTC
Your Perl script doesn't compile. C:\>chcp Active code page: 437 C:\>type 1090732.pl ∩╗┐use utf8; my $s1 = inet_aton('195.169.195.171'); print($s1); my $s2 = encode_utf8("├⌐├½"); print($s2); my $s3 = "├â┬⌐├â┬½"; print($s3); my $s4 = "\xC3\xA9\xC3\xAB"; print($s4); C:\>cat 1090732.pl use utf8; my $s1 = inet_aton('195.169.195.171'); print($s1); my $s2 = encode_utf8("éë"); print($s2); my $s3 = "Ã©Ã«"; print($s3); my $s4 = "\xC3\xA9\xC3\xAB"; print($s4); C:\>perl 1090732.pl Undefined subroutine &main::inet_aton called at 1090732.pl line 2. C:\>	[reply]
Re^6: Default encoding rules leave me puzzled... by ikegami (Patriarch) on Jun 23, 2014 at 01:53 UTC
`inet_aton` is provided by Socket, and `encode_utf8` is provided by Encode. I left a few obvious headers out since they weren't relevant. In all four cases, `print` outputs the four bytes `C3 A9 C3 AB` because in all four cases, the string passed to `print` was `"\xC3\xA9\xC3\xAB"`.	[reply] [d/l] [select]
Re^7: Default encoding rules leave me puzzled... by Jim (Curate) on Jun 23, 2014 at 18:33 UTC
`inet_aton` is provided by Socket, and `encode_utf8` is provided by Encode. I left a few obvious headers out since they weren't relevant. The headers can't possibly be irrelevant if the Perl script doesn't compile without them. And there's nothing intrinsically obvious about them either. If there were, then `perl` wouldn't need the programmer to `use` them. As it happens, I knew that `encode_utf8()` is from the Encode module because I'd used it before, but I didn't recognize `inet_aton()` because I'd never used the Socket module before. If you post something on PerlMonks to make a point, you can't neglect to make the point. Otherwise, you're just arguing obscurely and unhelpfully. In all four cases, `print` outputs the four bytes `C3 A9 C3 AB` because in all four cases, the string passed to `print` was "`\xC3\xA9\xC3\xAB`". This is the point you neglected to make. In your post, you didn't state the point explicitly, and you also didn't include the output of the Perl script you intended to demonstrate the point you were trying to make. You left it as an exercise for the reader to run your script, which we've established doesn't compile as posted. PerlMonks is now littered with threads much like this one. A monk comes to the Monastery seeking clarification about how character encodings and Unicode work in Perl—particularly for help understanding their myriad subtleties. Instead of getting clarification, the monk just gets more confusing details, oftentimes within a torrent of rhetorical arguments and even flame wars. You're involved in many of these discussions, and I think your explanations usually lead to more confusion rather than to greater clarity. I don't doubt that you're 100% correct in every gory technical detail. I just don't think you do an effective job of translating the technical facts from your complex mental model of them into clear information about the topic that ordinary Perl hackers can use to help them write Perl scripts.	[reply] [d/l] [select]
Re^8: Default encoding rules leave me puzzled... by ikegami (Patriarch) on Jun 25, 2014 at 17:12 UTC