Re^3: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding)

By now, Perl 5 should also be defaulting to Windows-1252 instead of to ISO 8859-1 (Latin 1)

I don't know of a single place where Perl assumes iso-8859-1.

There are many places where Perl requires strings of Unicode code points. (In the above program, those would be the match operator and the encoder.) Since the strings passed to those were created by assigning each byte to a character, each byte is taken to be a Unicode code point. Not an iso-8859-1 character.

This makes it *look* like Perl defaults to iso-8859-1, but there is no "default" since there is only ever one thing those functions can accept. Because there is no default, it also means the default cannot be changed, to cp1252 or anything else.

Comment on Re^3: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding)

Replies are listed 'Best First'.
Re^4: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by moritz (Cardinal) on Apr 19, 2012 at 07:19 UTC
Since the strings passed to those were created by assigning each byte to a character, each byte is taken to be a Unicode code point. Not an iso-8859-1 character. The act of interpreting a byte as a Unicode codepoint is exactly equivalent to decoding it as Latin-1. Which is why people say "Perl assumes ISO-8859-1", and that isn't wrong. Because there is no default, it also means the default cannot be changed, to cp1252 or anything else. Such a change is possible, though not as easy as it sounds. It would require Perl to keep track of what is a byte and what is a codepoint, which would be a major departure from the current model (but inevitable in the long run, IMHO). Perl 6 - the future is here, just unevenly distributed	[reply]
Re^5: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by ikegami (Patriarch) on Apr 19, 2012 at 16:56 UTC
Yes, it is equivalent, but that doesn't create the existence of iso-8859-1 as a default. Default indicates a choice, something that can be changed. This is a side-effect of a bug in the user's code, not a default. It would require Perl to keep track of what is a byte and what is a codepoint Even if you added a new type of data, I don't see how that helps. How can "É" match a byte? (Upd: Well, I suppose you could add a pragma to specify the encoding to use when Perl needs text from bytes, but wouldn't that break `@-` and `pos`? So how would `/g` work? What about captures? They currently only capture from the supplied string, but that would have to be changed. Unless you're suggesting that the data in scalar actually changes when the decoding happens? Yeah, I've been working on this. ) (And it should probably be "byte, decoded text or unknown", if only for backwards compatibility.)	[reply] [d/l] [select]
Re^6: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by moritz (Cardinal) on Apr 19, 2012 at 19:35 UTC
You are right, I didn't consider how indexing into a buffer works which contains multi-byte characters. There is an ugly solution for that, which would be a new type of scalar that stores two numbers, one for the byte index and one for the codepoint index. But let's not go there. Now I'm even more at a loss on how to make p5's Unicode handling more robust. Maybe a three-way flag (byte/codepoint/unknown) could be introduced, and operations on incompatible types could then at least warn (probably with a warning not enabled by default), but not coerce. And it would provide at least some measure of introspection. Perl 6 - the future is here, just unevenly distributed	[reply]
Re^7: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by BrowserUk (Patriarch) on Apr 19, 2012 at 22:33 UTC
Re^4: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by Jim (Curate) on Apr 19, 2012 at 17:09 UTC
From perlunicode… "use encoding" needed to upgrade non-Latin-1 byte strings By default, there is a fundamental asymmetry in Perl's Unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 codepoints in Unicode happens to agree with Latin-1.	[reply]
Re^5: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by ikegami (Patriarch) on Apr 24, 2012 at 02:16 UTC
Both the quoted passage and I said any tie to latin-1 is merely a side-effect. It's not something configurable.	[reply]