Re: uc and German eszett "ß"
by hippo (Archbishop) on Feb 01, 2022 at 20:35 UTC
|
That is the very character used as an example in the doc for lc - does that help to clarify things in terms of how the locale, utf-8 flag, bytes pragma etc. affect it all?
| [reply] |
|
|
use strict;
use warnings;
use utf8;
use open qw(:std :utf8);
$\="\n";
print "Perlversion $]";
my $SS = "\x{1E9E}";
no locale;
print "=== local off LANG=$ENV{LANG}";
print "* TEST UC";
print "$_ -> ",ord($_) for "ß", "\Uß", uc("ß");
print "* TEST LC";
print "$_ -> ",ord($_) for $SS, "\L$SS", lc($SS);
use locale;
print "=== local on LANG=$ENV{LANG}";
print "* TEST UC";
print "$_ -> ",ord($_) for "ß", "\Uß", uc("ß");
print "* TEST LC";
print "$_ -> ",ord($_) for $SS, "\L$SS", lc($SS);
Can't do lc("\x{1E9E}") on non-UTF-8 locale; resolved to "\x{1E9E}". a
+t d:/tmp/job/eszet.pl line 33.
Can't do lc("\x{1E9E}") on non-UTF-8 locale; resolved to "\x{1E9E}". a
+t d:/tmp/job/eszet.pl line 33.
Perlversion 5.032001
=== local off LANG=DEU
* TEST UC
ß -> 223
SS -> 83
SS -> 83
* TEST LC
ẞ -> 7838
ß -> 223
ß -> 223
=== local on LANG=DEU
* TEST UC
ß -> 223
ß -> 223
ß -> 223
* TEST LC
ẞ -> 7838
ẞ -> 7838
ẞ -> 7838
NB: the warnings happen only when local is used. Which deactivates all conversion here.
Furthermore is ẞ a display problem of the monastery's code blocks, the character prints well inside my emacs.
update
I suppose Perl follows "unicode rules", but those haven't been updated yet to new "German rules" ... | [reply] [d/l] [select] |
Re: uc and German eszett "ß"
by kcott (Archbishop) on Feb 02, 2022 at 10:09 UTC
|
G'day Rolf,
Here's all of the variations that I could think of:
$ perl -v | head -2 | tail -1
This is perl 5, version 34, subversion 0 (v5.34.0) built for cygwin-th
+read-multi
$ echo $LANG
en_AU.UTF-8
$ alias perlu
alias perlu='perl -Mstrict -Mwarnings -Mautodie=:all -Mutf8 -C -E'
$ perlu '
say "$_ -> ", ord($_) for
"ß", "\Uß", uc("ß"),
"ẞ", "\Lẞ", lc("ẞ"),
"\Fß", fc("ß"),
"\Fẞ", fc("ẞ");
'
ß -> 223
SS -> 83
SS -> 83
ẞ -> 7838
ß -> 223
ß -> 223
ss -> 115
ss -> 115
ss -> 115
ss -> 115
From "Re^2: uc and German eszett "ß"":
"Furthermore is ẞ a display problem of the monastery's code blocks, the character prints well inside my emacs."
When using non-ASCII characters, I replace "code" with "pre" and "c" with "tt".
I think the problem is more to do with PM's encoding than a specific code block issue;
for example, you'll get the same rendering of entities, rather than characters, in paragraph text.
Someone more knowlegeable may have a better (more complete) answer to that.
Update:
I removed four instances of ken@titan ~/tmp that preceded each of the commands above.
I had originally just done a copy-paste from my screen, but that information is irrelevant clutter.
| [reply] [d/l] [select] |
|
|
The HTML specs are not very specific about how "code" vs "pre" really works. It's mostly on the order of "Dear Browser! FYI, this part is some sort of program code thing. Please do something about it if you want (but you are not required to)." Basically, the code tag is says "here is some text using the "monospace" font family"
The "pre" is just as vague tag preserves linebreaks and other whitespace characters and uses a fixed-width font. And again, that is pretty much all that the standard says about that, as far as i can tell.
perl -e 'use Crypt::Digest::SHA256 qw[sha256_hex]; print substr(sha256_hex("the Answer To Life, The Universe And Everything"), 6, 2), "\n";'
| [reply] [d/l] |
|
|
HTML spec is irrelevant here, PerlMonks interprets <code> in its own way.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
|
As ++choroba correctly points out, PM <code> is not HTML <code>.
See "Markup in the Monastery".
The PM <code> tag provides some conveniences.
It automatically handles certain special characters; for instance,
you can paste code with $x < $y without having to manually change that to $x < $y.
It also adds the "download" link for blocks of code.
The <code> and <c> are interchangeable.
I usually use the former for blocks and the latter for inline: that's just a personal preference.
With <pre> and <tt>, you will need to manually edit special characters;
accordingly, I try to keep these as small as possible.
You also don't get the "download" link.
| [reply] [d/l] [select] |
Re: uc and German eszett "ß" (Unicode standard)
by LanX (Saint) on Feb 02, 2022 at 10:47 UTC
|
Mystery solved: After reading up the docs for fc I realized that Perl implements this Unicode standard for cases:
The docs also claim that fc is the only acceptable solution for comparing "case insensitive strings".
Unfortunately are we working with an IBM product°, which sees a difference between fields named "Straße" and "Strasse", while fc (and German readers) will consider them identical.
Now regarding uc of "ß", that's to be decided between the Unicode committee and German language authorities.
(this is not so much a problem for German readers but for CS and "equality of strings")
update
°) which is surprisingly fucked up in this respect https://www.ibm.com/docs/en/cognos-tm1/10.2.2?topic=pitf-unicodeupperlowercase-1 | [reply] |
|
|
Unicode is hard. IBM is not to blame, we're having problems with ICU and Postgres, too.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] |
|
|
> Unicode is hard.
I agree, but what they did is even "harder".
Identifiers with unicodes are case sensitive, without are case insensitive.
update
TM1 treats ASCII object names as case-insensitive; the element name SALES is equivalent to sales. A reference to either SALES, sales, or even SaLeS is considered to be a reference to a single element. Similarly, the cube name Projections is equivalent to PROJECTIONS.
However, Unicode object names are not treated as case-insensitive. Consequently, a server can contain two identically named objects that varied only in case. For example, the elements NEMÈIJA and nemèija can exist in a single dimension, and each is considered a unique element
| [reply] |
|
|
//i is equally acceptable.
That's why qr/(?<=ss)x/ui fails in versions of Perl without experimental (and buggy) variable look-behinds. /u is added by use feature qw( unicode_strings ); since 5.14, and thus by use v5.14; and higher.
Also,
$ perl -Mre=debug -e'qr/[abc\xDF]/ui'
Compiling REx "[abc\xDF]"
~ tying lastbr BRANCH (4) to ender TAIL (15) offset 11
~ tying lastbr BRANCH (1) to ender END (16) offset 15
Final program:
1: BRANCH (4)
2: EXACTFUP <ss> (16)
4: BRANCH (FAIL)
5: ANYOF[ABCabc] (16)
15: TAIL (16)
16: END (0)
minlen 1
Freeing REx: "[abc\xDF]"
| [reply] [d/l] [select] |
|
|
| [reply] [d/l] [select] |
|
|
|
|
|
Re: uc and German eszett "ß"
by ikegami (Patriarch) on Feb 04, 2022 at 15:07 UTC
|
Perl's uc complies with Unicode's uppercasing rules ("when Unicode rules are in effect")
Case translation operators use the Unicode case translation tables.
Ref:
latest perlunicode
latest perlunicode at time of writing
So uc("ß") will change if and when Unicode changes.
The latest version of Unicode (14.0.0) indicates the uppercase of "ß" (LATIN SMALL LETTER SHARP S) is "SS". (The fourth field.)
00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
Ref:
Latest case special case rules
Latest case special case rules at time of writing
Seeing as the ẞ (LATIN CAPITAL LETTER SHARP S) has been part of Unicode since version 5.1 in 2008, this isn't likely to change.
| [reply] [d/l] [select] |
|
|
| [reply] |