Re^8: Database vs XML output representation of two-byte UTF-8 character

RTFM then.

huh? I asked what you meant.

LOL. It looks like Latin-1 and quacks like Latin-1, but it's not Latin-1. Yeah, it's just 'byte-packed subset of Unicode'.

huh? What are you talking about?

How about you 'fix' Perl's documentation, and then start arguing... It even talks about 'Unicode' and 'binary' strings (gasp).

You can start here. "Perl will assume that your binary string was encoded with ISO-8859-1" is indeed completely wrong. Concatenation does know or care what the string is.

$ perl -MEncode -E'
   $x = chr(0x2660);
   $y = encode($ARGV[0], chr(0xC9));
   say sprintf "%vX.%vX %vX", $x, $y, $x.$y;
' iso-latin-1
2660.C9 2660.C9

$ perl -MEncode -E'
   $x = chr(0x2660);
   $y = encode($ARGV[0], chr(0xC9));
   say sprintf "%vX.%vX %vX", $x, $y, $x.$y;
' UTF-8
2660.C3.89 2660.C3.89
[download]

Is that an error that perl -wE 'my $x = chr(0x00A9); say $x does one thing, and perl -wE 'my $y = chr(0x2660); say $y' does something else?

No, Perl "doing something else" (telling you you made an error) when you provide a bad input is not an error.

chr should be consistent,

huh? chr always returns a string consisting of the specified character.

So you're not even disagreeing. You just hate the word 'Latin-1'

huh? What are you talking about?!? No, I hate that you're saying your errors are errors in Perl. I hate that you are spreading misinformation about how Perl works. I hate that you're confusing people with issues that aren't even related to theirs. The OP's problem had nothing to do with internal storage formats.

Comment on Re^8: Database vs XML output representation of two-byte UTF-8 character Select or Download Code

Replies are listed 'Best First'.
Re^9: Database vs XML output representation of two-byte UTF-8 character (koolaid) by tye (Sage) on Sep 09, 2014 at 16:42 UTC
What they wrote wasn't hard for me to understand. I think it is due to you being overly submerged in the "the unicode bug" mindset koolaid that prevents you from understanding it. You seem even unable to realize that the author was quoting Perl's own documentation that starkly disagrees with your narrow way of viewing this. It is sad that a reasonable heuristic (if somebody concats a UTF-8 string with a non-UTF-8 string, a reasonable approach would be to assume Latin-1 and give a UTF-8 result) chosen for Perl long ago, has been elevated to some bizarre religion dedicated to maintaining with airtight absoluteness the fiction that "it doesn't matter how the string is encoded". And it has come to the point that one can't even try to increase clarity by describing actual facts about how things are encoded without being contradicted by cult members claiming that one is completely wrong. Yes, one can choose to view Perl's handling of strings and Unicode in the "the unicode bug" way where how a string is actually encoded/stored shouldn't matter (and quite often doesn't matter in the end). And that can even be a useful approach. But that is not the only valid way to think about this stuff. Worse, demanding that people not even consider how a string is actually stored just leaves a huge opportunity for confusion. To be successful in using the "the unicode bug" mindset, many people first have to obtain an understanding of how Perl proposes to make the encoding not matter. So, for many people, you have to first explain the details about the encoding of Perl strings and how it gets changed and why that is often a reasonable approach before they can accept the "the encoding doesn't matter" premise and start making sound decisions based upon it. So, for people not already steeped in the "the unicode bug" koolaid, it is best, in my experience, to start with "Perl has byte strings and UTF-8 strings and when they cross paths, the byte string is assumed to be Latin-1 and is upgraded to UTF-8". After that, then you can explain that it isn't really UTF-8 but Perl's own extension to UTF-8 (called "utf8" or so) and that the assumption isn't strictly "Latin-1" (though the distinctions on that second point are too subtle for me to discern with any clarity). But those clarifications mostly just don't matter except to pedants. And then you can explain about how the encoding shouldn't matter and that you are meant to decode all inputs and encode all outputs, etc. The worst part about the "the unicode bug" koolaid is that it completely blocks even discussing (much less actually considering) real improvements to Perl's string/Unicode handling. It is completely useless to propose that "assume Latin-1" should actually be "assume Windows-1252" or "assume current locale" because such concepts appear to simply not even make sense to many core maintainers of Perl now. - tye	[reply]

Replies are listed 'Best First'.

Re^9: Database vs XML output representation of two-byte UTF-8 character (koolaid)
by tye (Sage) on Sep 09, 2014 at 16:42 UTC

What they wrote wasn't hard for me to understand. I think it is due to you being overly submerged in the "the unicode bug" mindset koolaid that prevents you from understanding it. You seem even unable to realize that the author was quoting Perl's own documentation that starkly disagrees with your narrow way of viewing this.

It is sad that a reasonable heuristic (if somebody concats a UTF-8 string with a non-UTF-8 string, a reasonable approach would be to assume Latin-1 and give a UTF-8 result) chosen for Perl long ago, has been elevated to some bizarre religion dedicated to maintaining with airtight absoluteness the fiction that "it doesn't matter how the string is encoded". And it has come to the point that one can't even try to increase clarity by describing actual facts about how things are encoded without being contradicted by cult members claiming that one is completely wrong.

Yes, one can choose to view Perl's handling of strings and Unicode in the "the unicode bug" way where how a string is actually encoded/stored shouldn't matter (and quite often doesn't matter in the end). And that can even be a useful approach. But that is not the only valid way to think about this stuff.

Worse, demanding that people not even consider how a string is actually stored just leaves a huge opportunity for confusion. To be successful in using the "the unicode bug" mindset, many people first have to obtain an understanding of how Perl proposes to make the encoding not matter. So, for many people, you have to first explain the details about the encoding of Perl strings and how it gets changed and why that is often a reasonable approach before they can accept the "the encoding doesn't matter" premise and start making sound decisions based upon it.

So, for people not already steeped in the "the unicode bug" koolaid, it is best, in my experience, to start with "Perl has byte strings and UTF-8 strings and when they cross paths, the byte string is assumed to be Latin-1 and is upgraded to UTF-8". After that, then you can explain that it isn't really UTF-8 but Perl's own extension to UTF-8 (called "utf8" or so) and that the assumption isn't strictly "Latin-1" (though the distinctions on that second point are too subtle for me to discern with any clarity). But those clarifications mostly just don't matter except to pedants. And then you can explain about how the encoding shouldn't matter and that you are meant to decode all inputs and encode all outputs, etc.

The worst part about the "the unicode bug" koolaid is that it completely blocks even discussing (much less actually considering) real improvements to Perl's string/Unicode handling. It is completely useless to propose that "assume Latin-1" should actually be "assume Windows-1252" or "assume current locale" because such concepts appear to simply not even make sense to many core maintainers of Perl now.

- tye

[reply]