Re^2: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding)

Thank you for your thorough explanation, tye. You answered my question.

The W3C is doing the right thing. (See 8.2.2.2 Character encodings in the HTML5 working draft specification.) Its willful violation of anachronistic standards for compelling, practical reasons is, IMHO, a practice that is overdue in Perl 5. By now, Perl 5 should also be defaulting to Windows-1252 instead of to ISO 8859-1 (Latin 1). Its failure to do this is one of the little things that make Perl 5 seem old and crufty, especially to Windows programmers. By dogmatically adhering to some misguided commitment to compatibility and portability, Perl 5 violates the principle of least astonishment.

By the way, I had done something like this…

C:\>chcp
Active code page: 1252

C:\>type match_test_3.pl
#!perl

use strict;
use warnings;
use open qw( :encoding(Windows-1252) :std );

my $pattern = qr/\A\w+\z/;

for my $word (@ARGV) {
    my $result = $word =~ $pattern ? "matches" : "doesn't match";
    printf qq/The word "%s" %s the pattern %s\n/, $word, $result, $pat
+tern;
}

C:\>perl match_test_3.pl Tšekissä Žena Œdipus Rex
"\x{009a}" does not map to cp1252 at match_test_3.pl line 12.
The word "T\x{009a}ekissä" doesn't match the pattern (?^:\A\w+\z)
"\x{008e}" does not map to cp1252 at match_test_3.pl line 12.
The word "\x{008e}ena" doesn't match the pattern (?^:\A\w+\z)
"\x{008c}" does not map to cp1252 at match_test_3.pl line 12.
The word "\x{008c}dipus" doesn't match the pattern (?^:\A\w+\z)
The word "Rex" matches the pattern (?^:\A\w+\z)

C:\>
[download]

…before I posted my inquiry here to prove to myself that the problem wasn't just with the use within the Perl source file of Windows-1252 characters in the range from 80 thru 9F.

There's a Feedback button at the bottom of the page http://www.fileformat.info/info/unicode/char/009a/index.htm. ☺

Thanks again.

Jim

Comment on Re^2: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) Download Code

Replies are listed 'Best First'.
Re^3: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by Tux (Canon) on Apr 19, 2012 at 06:16 UTC
Perl 5 should also be defaulting to Windows-1252 instead of to ISO 8859-1 (Latin 1). I really hope this will NEVER happen, not even on a Windows platform. cp1252 is only "default" on Windows, it is not the default on any other platform and changing perl5's default to cp1252 would break every script that assumes the current default (wise or not). Most perl scripts are cross-platform portable, at least they can be when the programmer follows the basic porting rules. Most of my scripts and modules are cross platform, and I do test my modules on HP-UX, Linux, AIX and Windows (and sometimes even on OSX when I can access such architecture). That said, the default IMHO is likely to change for Windows. If not in Windows 8 (or whatever they will call it) then maybe Windows 9 or 10 will have Unicode as default character set. Problem solved. I already use UTF-8 as default encoding on all my browsers (Opera, Firefox, Konqueror, Opera Mobile) and IRC. My advise to you would be to switch to using utf-8 (and declare '`use utf8;`' next to `use strict;` and `use warnings;` in the head of your scripts when you do. Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^4: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by Jim (Curate) on Apr 19, 2012 at 17:44 UTC
Every one of your browsers—Opera, Firefox, Konqueror and Opera Mobile—will default to the Windows-1252 character encoding if wrongly told by the creator of the HTML document that the text of the document is in the Latin‑1 (ISO 8859‑1) character encoding. It was exactly this behavior of these popular web browsers that influenced the W3C to standardize this practice in its specification of HTML5—a willful violation of existing standards. How many existing cross-platform Perl scripts treat characters in the range from 80 thru 9F as ISO 8859‑1 control codes? Maybe lots of them do. I don't know. Jim	[reply]
Re^4: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by Anonymous Monk on May 22, 2012 at 22:14 UTC
Most of my scripts and modules are cross platform Unicode::Tussle doesn't appear to be runnable on win32 :)	[reply]
Re^5: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by Tux (Canon) on May 23, 2012 at 05:40 UTC
Well, that is not one of my modules or scripts, nor does it depend on any of my modules or scripts. What point do you want to make here? Enjoy, Have FUN! H.Merijn	[reply]
Re^6: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by Anonymous Monk on May 23, 2012 at 06:02 UTC
Re^3: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by ikegami (Patriarch) on Apr 19, 2012 at 05:57 UTC
By now, Perl 5 should also be defaulting to Windows-1252 instead of to ISO 8859-1 (Latin 1) I don't know of a single place where Perl assumes iso-8859-1. There are many places where Perl requires strings of Unicode code points. (In the above program, those would be the match operator and the encoder.) Since the strings passed to those were created by assigning each byte to a character, each byte is taken to be a Unicode code point. Not an iso-8859-1 character. This makes it look like Perl defaults to iso-8859-1, but there is no "default" since there is only ever one thing those functions can accept. Because there is no default, it also means the default cannot be changed, to cp1252 or anything else.	[reply]
Re^4: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by moritz (Cardinal) on Apr 19, 2012 at 07:19 UTC
Since the strings passed to those were created by assigning each byte to a character, each byte is taken to be a Unicode code point. Not an iso-8859-1 character. The act of interpreting a byte as a Unicode codepoint is exactly equivalent to decoding it as Latin-1. Which is why people say "Perl assumes ISO-8859-1", and that isn't wrong. Because there is no default, it also means the default cannot be changed, to cp1252 or anything else. Such a change is possible, though not as easy as it sounds. It would require Perl to keep track of what is a byte and what is a codepoint, which would be a major departure from the current model (but inevitable in the long run, IMHO). Perl 6 - the future is here, just unevenly distributed	[reply]
Re^5: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by ikegami (Patriarch) on Apr 19, 2012 at 16:56 UTC
Yes, it is equivalent, but that doesn't create the existence of iso-8859-1 as a default. Default indicates a choice, something that can be changed. This is a side-effect of a bug in the user's code, not a default. It would require Perl to keep track of what is a byte and what is a codepoint Even if you added a new type of data, I don't see how that helps. How can "É" match a byte? (Upd: Well, I suppose you could add a pragma to specify the encoding to use when Perl needs text from bytes, but wouldn't that break `@-` and `pos`? So how would `/g` work? What about captures? They currently only capture from the supplied string, but that would have to be changed. Unless you're suggesting that the data in scalar actually changes when the decoding happens? Yeah, I've been working on this. ) (And it should probably be "byte, decoded text or unknown", if only for backwards compatibility.)	[reply] [d/l] [select]
Re^6: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by moritz (Cardinal) on Apr 19, 2012 at 19:35 UTC
Re^7: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by BrowserUk (Patriarch) on Apr 19, 2012 at 22:33 UTC
Re^4: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by Jim (Curate) on Apr 19, 2012 at 17:09 UTC
From perlunicode… "use encoding" needed to upgrade non-Latin-1 byte strings By default, there is a fundamental asymmetry in Perl's Unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 codepoints in Unicode happens to agree with Latin-1.	[reply]
Re^5: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by ikegami (Patriarch) on Apr 24, 2012 at 02:16 UTC
Both the quoted passage and I said any tie to latin-1 is merely a side-effect. It's not something configurable.	[reply]
Re^3: Windows-1252 characters from \x{0080} thru \x{009f} (source-code encoding) by Your Mother (Archbishop) on Apr 24, 2012 at 03:41 UTC
Perl 5 should also be defaulting to Windows-1252 … little things that make Perl 5 seem old and crufty … Perl 5 violates the principle of least astonishment. So, a highly limited Latin only encoding seems modern/uncrufty to you in 2012? There are many encodings and it’s pretty easy with newer perls to use whatever you like or to default to the entirely reasonable utf-8. And not to put too fine a point on it—as the kids used to say and with full knowledge that two of the very best hackers on PM are WinCats—but I’ve never ceased to be astonished that anyone used Windows ever.	[reply]
Re^4: Windows-1252 characters from \x{0080} thru \x{009f} by Jim (Curate) on Apr 26, 2012 at 02:16 UTC
So, a highly limited Latin only encoding seems modern/uncrufty to you in 2012? The Windows‑1252 character set isn't "highly limited." I'm an English-speaking monoglot—or, more to the point, an English-writing monoglot—so I can use the Windows‑1252 character set for all my writing. And I can likely continue to use it for a very long time, either until I learn another language that uses a writing system other than Latin, or until I drop dead. Saying Windows‑1252 is highly limited because it can't be used to write Chinese or Hebrew is like saying my Toyota Corolla is highly limited because it can't fly in the sky or sail the seas. The Windows‑1252 and ISO 8859‑1 (Latin 1) character sets are still very commonly used today for digital text. For example, in my industry, e-discovery and litigation support in the United States, text and data are much more often Windows‑1252 than Unicode (UTF‑8). This is just how it is. So, no, Windows‑1252 and Latin 1 don't seem especially unmodern or crufty to me. They're just older, single-byte encodings, not Unicode, that's all. By the way, I'm a proponent of Unicode and I support and encourage its adoption. I'm a member of the Unicode Consortium. My name is proudly displayed on its Members page. ☺ (I confess I'm not an active member; I just pay to belong.) I've attended several Unicode Conferences and have had the good fortune to rub elbows with the Unicode cognescenti. My keen interest in Unicode dovetails nicely with my love of Perl, whose Unicode support is excellent. Jim	[reply]
Re^5: Windows-1252 characters from \x{0080} thru \x{009f} by grantm (Parson) on May 24, 2012 at 01:55 UTC
I can use the Windows‑1252 character set for all my writing ... ☺ I find it ironic that you claim to be able to get by with only the Windows‑1252 character set and then a few paragraphs later you use a character that's not in it. Sure you can enter that character using the HTML numeric character entity form ☺ - but then the same is true of any non-ASCII character. So I don't really see why 1252 is so appealing to you. Given the more obvious choices of ASCII or Unicode why choose an encoding that is neither one thing nor the other?	[reply]