in reply to Re^2: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding)
in thread Windows-1252 characters from \x{0080} thru \x{009f}

By now, Perl 5 should also be defaulting to Windows-1252 instead of to ISO 8859-1 (Latin 1)

I don't know of a single place where Perl assumes iso-8859-1.

There are many places where Perl requires strings of Unicode code points. (In the above program, those would be the match operator and the encoder.) Since the strings passed to those were created by assigning each byte to a character, each byte is taken to be a Unicode code point. Not an iso-8859-1 character.

This makes it *look* like Perl defaults to iso-8859-1, but there is no "default" since there is only ever one thing those functions can accept. Because there is no default, it also means the default cannot be changed, to cp1252 or anything else.

  • Comment on Re^3: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding)

Replies are listed 'Best First'.
Re^4: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding)
by moritz (Cardinal) on Apr 19, 2012 at 07:19 UTC
    Since the strings passed to those were created by assigning each byte to a character, each byte is taken to be a Unicode code point. Not an iso-8859-1 character.

    The act of interpreting a byte as a Unicode codepoint is exactly equivalent to decoding it as Latin-1. Which is why people say "Perl assumes ISO-8859-1", and that isn't wrong.

    Because there is no default, it also means the default cannot be changed, to cp1252 or anything else.

    Such a change is possible, though not as easy as it sounds. It would require Perl to keep track of what is a byte and what is a codepoint, which would be a major departure from the current model (but inevitable in the long run, IMHO).

      Yes, it is equivalent, but that doesn't create the existence of iso-8859-1 as a default. Default indicates a choice, something that can be changed. This is a side-effect of a bug in the user's code, not a default.

      It would require Perl to keep track of what is a byte and what is a codepoint

      Even if you added a new type of data, I don't see how that helps. How can "É" match a byte? (Upd: Well, I suppose you could add a pragma to specify the encoding to use when Perl needs text from bytes, but wouldn't that break @- and pos? So how would /g work? What about captures? They currently only capture from the supplied string, but that would have to be changed. Unless you're suggesting that the data in scalar actually changes when the decoding happens? Yeah, I've been working on this. )

      (And it should probably be "byte, decoded text or unknown", if only for backwards compatibility.)

        You are right, I didn't consider how indexing into a buffer works which contains multi-byte characters. There is an ugly solution for that, which would be a new type of scalar that stores two numbers, one for the byte index and one for the codepoint index. But let's not go there.

        Now I'm even more at a loss on how to make p5's Unicode handling more robust. Maybe a three-way flag (byte/codepoint/unknown) could be introduced, and operations on incompatible types could then at least warn (probably with a warning not enabled by default), but not coerce. And it would provide at least some measure of introspection.

Re^4: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding)
by Jim (Curate) on Apr 19, 2012 at 17:09 UTC

    From perlunicode

    "use encoding" needed to upgrade non-Latin-1 byte strings
    By default, there is a fundamental asymmetry in Perl's Unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 codepoints in Unicode happens to agree with Latin-1.
      Both the quoted passage and I said any tie to latin-1 is merely a side-effect. It's not something configurable.