in reply to Windows-1252 characters from \x{0080} thru \x{009f}

No, "\x{009a}" (a Unicode character) does not map to cp1252.

You did not tell Perl a specific encoding to use for your source code. So Perl assumed that your source code was encoded in Latin-1. Your examples show that you treated your source code as encoded in Windows-1252. So it isn't particularly surprising that Perl and you disagree about some of the characters in your source code (hard-coded into string literals).

So, for example, byte \x9a looks like an accented character when interpreted as Windows-1252 (something that this website also does -- check the headers). It looks just like (is the same character as) the Unicode character "\x{0161}" (š).

But Perl assumes that byte \x9a is in Latin-1 and so treats it the same as the Unicode character "\x{009a}" (a control character, 'single character introducer', that shouldn't be visible if I tried to reproduce it here), which is a character not available in Windows-1252.

So Perl tells you that it can't convert that character to Windows-1252.

Now, it has become very common for things claiming to be Latin-1 to actually include bytes from Windows-1252 with the desire and expectation to have them interpreted as Windows-1252 not as Latin-1. So common that w3c even decided that web pages claiming to be Latin-1 should actually just be treated like they claimed that they were Windows-1252.

And it looks like that decision may have confused, for example, http://www.fileformat.info/info/unicode/char/009a/index.htm, which (for me, anyway) shows a nice hatted 's' despite claiming it is an "Other, Control" type of character (compare to http://www.fileformat.info/info/unicode/char/0161/index.htm).

[ Note that the w3c declaring "treat Latin-1 as Windows-1252" for web pages, does not change the definition of either of those character sets nor have any impact on how Encode converts between them nor on how Perl treats script source code (not downloaded from a web page). ]

- tye        

  • Comment on Re: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding)

Replies are listed 'Best First'.
Re^2: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding)
by Jim (Curate) on Apr 19, 2012 at 04:47 UTC

    Thank you for your thorough explanation, tye. You answered my question.

    The W3C is doing the right thing. (See 8.2.2.2 Character encodings in the HTML5 working draft specification.) Its willful violation of anachronistic standards for compelling, practical reasons is, IMHO, a practice that is overdue in Perl 5. By now, Perl 5 should also be defaulting to Windows-1252 instead of to ISO 8859-1 (Latin 1). Its failure to do this is one of the little things that make Perl 5 seem old and crufty, especially to Windows programmers. By dogmatically adhering to some misguided commitment to compatibility and portability, Perl 5 violates the principle of least astonishment.

    By the way, I had done something like this…

    C:\>chcp Active code page: 1252 C:\>type match_test_3.pl #!perl use strict; use warnings; use open qw( :encoding(Windows-1252) :std ); my $pattern = qr/\A\w+\z/; for my $word (@ARGV) { my $result = $word =~ $pattern ? "matches" : "doesn't match"; printf qq/The word "%s" %s the pattern %s\n/, $word, $result, $pat +tern; } C:\>perl match_test_3.pl Tšekissä Žena Œdipus Rex "\x{009a}" does not map to cp1252 at match_test_3.pl line 12. The word "T\x{009a}ekissä" doesn't match the pattern (?^:\A\w+\z) "\x{008e}" does not map to cp1252 at match_test_3.pl line 12. The word "\x{008e}ena" doesn't match the pattern (?^:\A\w+\z) "\x{008c}" does not map to cp1252 at match_test_3.pl line 12. The word "\x{008c}dipus" doesn't match the pattern (?^:\A\w+\z) The word "Rex" matches the pattern (?^:\A\w+\z) C:\>

    …before I posted my inquiry here to prove to myself that the problem wasn't just with the use within the Perl source file of Windows-1252 characters in the range from 80 thru 9F.

    There's a Feedback button at the bottom of the page http://www.fileformat.info/info/unicode/char/009a/index.htm. ☺

    Thanks again.

    Jim

      Perl 5 should also be defaulting to Windows-1252 instead of to ISO 8859-1 (Latin 1).

      I really hope this will NEVER happen, not even on a Windows platform. cp1252 is only "default" on Windows, it is not the default on any other platform and changing perl5's default to cp1252 would break every script that assumes the current default (wise or not).

      Most perl scripts are cross-platform portable, at least they can be when the programmer follows the basic porting rules. Most of my scripts and modules are cross platform, and I do test my modules on HP-UX, Linux, AIX and Windows (and sometimes even on OSX when I can access such architecture).

      That said, the default IMHO is likely to change for Windows. If not in Windows 8 (or whatever they will call it) then maybe Windows 9 or 10 will have Unicode as default character set. Problem solved. I already use UTF-8 as default encoding on all my browsers (Opera, Firefox, Konqueror, Opera Mobile) and IRC.

      My advise to you would be to switch to using utf-8 (and declare 'use utf8;' next to use strict; and use warnings; in the head of your scripts when you do.


      Enjoy, Have FUN! H.Merijn

        Every one of your browsers—Opera, Firefox, Konqueror and Opera Mobile—will default to the Windows-1252 character encoding if wrongly told by the creator of the HTML document that the text of the document is in the Latin‑1 (ISO 8859‑1) character encoding. It was exactly this behavior of these popular web browsers that influenced the W3C to standardize this practice in its specification of HTML5—a willful violation of existing standards.

        How many existing cross-platform Perl scripts treat characters in the range from 80 thru 9F as ISO 8859‑1 control codes? Maybe lots of them do. I don't know.

        Jim

        Most of my scripts and modules are cross platform

        Unicode::Tussle doesn't appear to be runnable on win32 :)

      By now, Perl 5 should also be defaulting to Windows-1252 instead of to ISO 8859-1 (Latin 1)

      I don't know of a single place where Perl assumes iso-8859-1.

      There are many places where Perl requires strings of Unicode code points. (In the above program, those would be the match operator and the encoder.) Since the strings passed to those were created by assigning each byte to a character, each byte is taken to be a Unicode code point. Not an iso-8859-1 character.

      This makes it *look* like Perl defaults to iso-8859-1, but there is no "default" since there is only ever one thing those functions can accept. Because there is no default, it also means the default cannot be changed, to cp1252 or anything else.

        Since the strings passed to those were created by assigning each byte to a character, each byte is taken to be a Unicode code point. Not an iso-8859-1 character.

        The act of interpreting a byte as a Unicode codepoint is exactly equivalent to decoding it as Latin-1. Which is why people say "Perl assumes ISO-8859-1", and that isn't wrong.

        Because there is no default, it also means the default cannot be changed, to cp1252 or anything else.

        Such a change is possible, though not as easy as it sounds. It would require Perl to keep track of what is a byte and what is a codepoint, which would be a major departure from the current model (but inevitable in the long run, IMHO).

        From perlunicode

        "use encoding" needed to upgrade non-Latin-1 byte strings
        By default, there is a fundamental asymmetry in Perl's Unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 codepoints in Unicode happens to agree with Latin-1.
      Perl 5 should also be defaulting to Windows-1252 … little things that make Perl 5 seem old and crufty … Perl 5 violates the principle of least astonishment.

      So, a highly limited Latin only encoding seems modern/uncrufty to you in 2012? There are many encodings and it’s pretty easy with newer perls to use whatever you like or to default to the entirely reasonable utf-8. And not to put too fine a point on it—as the kids used to say and with full knowledge that two of the very best hackers on PM are WinCats—but I’ve never ceased to be astonished that anyone used Windows ever.

        So, a highly limited Latin only encoding seems modern/uncrufty to you in 2012?

        The Windows‑1252 character set isn't "highly limited." I'm an English-speaking monoglot—or, more to the point, an English-writing monoglot—so I can use the Windows‑1252 character set for all my writing. And I can likely continue to use it for a very long time, either until I learn another language that uses a writing system other than Latin, or until I drop dead. Saying Windows‑1252 is highly limited because it can't be used to write Chinese or Hebrew is like saying my Toyota Corolla is highly limited because it can't fly in the sky or sail the seas.

        The Windows‑1252 and ISO 8859‑1 (Latin 1) character sets are still very commonly used today for digital text. For example, in my industry, e-discovery and litigation support in the United States, text and data are much more often Windows‑1252 than Unicode (UTF‑8). This is just how it is.

        So, no, Windows‑1252 and Latin 1 don't seem especially unmodern or crufty to me. They're just older, single-byte encodings, not Unicode, that's all.

        By the way, I'm a proponent of Unicode and I support and encourage its adoption. I'm a member of the Unicode Consortium. My name is proudly displayed on its Members page. ☺ (I confess I'm not an active member; I just pay to belong.) I've attended several Unicode Conferences and have had the good fortune to rub elbows with the Unicode cognescenti. My keen interest in Unicode dovetails nicely with my love of Perl, whose Unicode support is excellent.

        Jim