Re^4: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding)

Since the strings passed to those were created by assigning each byte to a character, each byte is taken to be a Unicode code point. Not an iso-8859-1 character.

The act of interpreting a byte as a Unicode codepoint is exactly equivalent to decoding it as Latin-1. Which is why people say "Perl assumes ISO-8859-1", and that isn't wrong.

Because there is no default, it also means the default cannot be changed, to cp1252 or anything else.

Such a change is possible, though not as easy as it sounds. It would require Perl to keep track of what is a byte and what is a codepoint, which would be a major departure from the current model (but inevitable in the long run, IMHO).

Perl 6 - the future is here, just unevenly distributed

Comment on Re^4: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding)

Replies are listed 'Best First'.
Re^5: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by ikegami (Patriarch) on Apr 19, 2012 at 16:56 UTC
Yes, it is equivalent, but that doesn't create the existence of iso-8859-1 as a default. Default indicates a choice, something that can be changed. This is a side-effect of a bug in the user's code, not a default. It would require Perl to keep track of what is a byte and what is a codepoint Even if you added a new type of data, I don't see how that helps. How can "É" match a byte? (Upd: Well, I suppose you could add a pragma to specify the encoding to use when Perl needs text from bytes, but wouldn't that break `@-` and `pos`? So how would `/g` work? What about captures? They currently only capture from the supplied string, but that would have to be changed. Unless you're suggesting that the data in scalar actually changes when the decoding happens? Yeah, I've been working on this. ) (And it should probably be "byte, decoded text or unknown", if only for backwards compatibility.)	[reply] [d/l] [select]
Re^6: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by moritz (Cardinal) on Apr 19, 2012 at 19:35 UTC
You are right, I didn't consider how indexing into a buffer works which contains multi-byte characters. There is an ugly solution for that, which would be a new type of scalar that stores two numbers, one for the byte index and one for the codepoint index. But let's not go there. Now I'm even more at a loss on how to make p5's Unicode handling more robust. Maybe a three-way flag (byte/codepoint/unknown) could be introduced, and operations on incompatible types could then at least warn (probably with a warning not enabled by default), but not coerce. And it would provide at least some measure of introspection. Perl 6 - the future is here, just unevenly distributed	[reply]
Re^7: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by BrowserUk (Patriarch) on Apr 19, 2012 at 22:33 UTC
Now I'm even more at a loss on how to make p5's Unicode handling more robust. There is an efficient, workable solution to this problem. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]