uc and German eszett "ß"

LanX has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: uc and German eszett "ß" by hippo (Archbishop) on Feb 01, 2022 at 20:35 UTC
That is the very character used as an example in the doc for lc - does that help to clarify things in terms of how the locale, utf-8 flag, bytes pragma etc. affect it all? 🦛	[reply]
Re^2: uc and German eszett "ß" by LanX (Saint) on Feb 01, 2022 at 21:19 UTC
> does that help to clarify things in terms of how the locale, utf-8 flag, bytes pragma etc. affect it all? hmm ... I'm still confused. It seems `lc` works well while `uc` wasn't updated yet. Which is counterintuitive. `use strict; use warnings; use utf8; use open qw(:std :utf8); $\="\n"; print "Perlversion $]"; my $SS = "\x{1E9E}"; no locale; print "=== local off LANG=$ENV{LANG}"; print "* TEST UC"; print "$_ -> ",ord($_) for "ß", "\Uß", uc("ß"); print "* TEST LC"; print "$_ -> ",ord($_) for $SS, "\L$SS", lc($SS); use locale; print "=== local on LANG=$ENV{LANG}"; print "* TEST UC"; print "$_ -> ",ord($_) for "ß", "\Uß", uc("ß"); print "* TEST LC"; print "$_ -> ",ord($_) for $SS, "\L$SS", lc($SS);` [download] `Can't do lc("\x{1E9E}") on non-UTF-8 locale; resolved to "\x{1E9E}". a +t d:/tmp/job/eszet.pl line 33. Can't do lc("\x{1E9E}") on non-UTF-8 locale; resolved to "\x{1E9E}". a +t d:/tmp/job/eszet.pl line 33. Perlversion 5.032001 === local off LANG=DEU * TEST UC ß -> 223 SS -> 83 SS -> 83 * TEST LC ẞ -> 7838 ß -> 223 ß -> 223 === local on LANG=DEU * TEST UC ß -> 223 ß -> 223 ß -> 223 * TEST LC ẞ -> 7838 ẞ -> 7838 ẞ -> 7838` [download] NB: the warnings happen only when `local` is used. Which deactivates all conversion here. Furthermore is `ẞ` a display problem of the monastery's code blocks, the character prints well inside my emacs. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery} update I suppose Perl follows "unicode rules", but those haven't been updated yet to new "German rules" ...	[reply] [d/l] [select]
Re: uc and German eszett "ß" by kcott (Archbishop) on Feb 02, 2022 at 10:09 UTC
G'day Rolf, Here's all of the variations that I could think of: `$ perl -v \| head -2 \| tail -1 This is perl 5, version 34, subversion 0 (v5.34.0) built for cygwin-th +read-multi $ echo $LANG en_AU.UTF-8 $ alias perlu alias perlu='perl -Mstrict -Mwarnings -Mautodie=:all -Mutf8 -C -E'` [download] $ perlu ' say "$_ -> ", ord($_) for "ß", "\Uß", uc("ß"), "ẞ", "\Lẞ", lc("ẞ"), "\Fß", fc("ß"), "\Fẞ", fc("ẞ"); ' ß -> 223 SS -> 83 SS -> 83 ẞ -> 7838 ß -> 223 ß -> 223 ss -> 115 ss -> 115 ss -> 115 ss -> 115 From "Re^2: uc and German eszett "ß"": "Furthermore is `ẞ` a display problem of the monastery's code blocks, the character prints well inside my emacs." When using non-ASCII characters, I replace "code" with "pre" and "c" with "tt". I think the problem is more to do with PM's encoding than a specific code block issue; for example, you'll get the same rendering of entities, rather than characters, in paragraph text. Someone more knowlegeable may have a better (more complete) answer to that. Update: I removed four instances of `ken@titan ~/tmp` that preceded each of the commands above. I had originally just done a copy-paste from my screen, but that information is irrelevant clutter. — Ken	[reply] [d/l] [select]
Re^2: uc and German eszett "ß" by cavac (Prior) on Feb 02, 2022 at 14:01 UTC
The HTML specs are not very specific about how "code" vs "pre" really works. It's mostly on the order of "Dear Browser! FYI, this part is some sort of program code thing. Please do something about it if you want (but you are not required to)." Basically, the code tag is says "here is some text using the "monospace" font family" The "pre" is just as vague tag preserves linebreaks and other whitespace characters and uses a fixed-width font. And again, that is pretty much all that the standard says about that, as far as i can tell. `perl -e 'use Crypt::Digest::SHA256 qw[sha256_hex]; print substr(sha256_hex("the Answer To Life, The Universe And Everything"), 6, 2), "\n";'`	[reply] [d/l]
Re^3: uc and German eszett "ß" by choroba (Cardinal) on Feb 02, 2022 at 14:25 UTC
HTML spec is irrelevant here, PerlMonks interprets `<code>` in its own way. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^3: uc and German eszett "ß" by kcott (Archbishop) on Feb 02, 2022 at 22:01 UTC
As ++choroba correctly points out, PM `<code>` is not HTML `<code>`. See "Markup in the Monastery". The PM `<code>` tag provides some conveniences. It automatically handles certain special characters; for instance, you can paste code with `$x < $y` without having to manually change that to `$x < $y`. It also adds the "download" link for blocks of code. The `<code>` and `<c>` are interchangeable. I usually use the former for blocks and the latter for inline: that's just a personal preference. With `<pre>` and `<tt>`, you will need to manually edit special characters; accordingly, I try to keep these as small as possible. You also don't get the "download" link. — Ken	[reply] [d/l] [select]
Re: uc and German eszett "ß" (Unicode standard) by LanX (Saint) on Feb 02, 2022 at 10:47 UTC
Mystery solved: After reading up the docs for `fc` I realized that Perl implements this Unicode standard for cases: https://www.unicode.org/charts/case/ The docs also claim that `fc` is the only acceptable solution for comparing "case insensitive strings". Unfortunately are we working with an IBM product°, which sees a difference between fields named "Straße" and "Strasse", while `fc` (and German readers) will consider them identical. Now regarding `uc` of "ß", that's to be decided between the Unicode committee and German language authorities. (this is not so much a problem for German readers but for CS and "equality of strings") Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery} update °) which is surprisingly fucked up in this respect https://www.ibm.com/docs/en/cognos-tm1/10.2.2?topic=pitf-unicodeupperlowercase-1	[reply]
Re^2: uc and German eszett "ß" (Unicode standard) by choroba (Cardinal) on Feb 02, 2022 at 12:10 UTC
Unicode is hard. IBM is not to blame, we're having problems with ICU and Postgres, too. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re^3: uc and German eszett "ß" (Unicode standard) by LanX (Saint) on Feb 02, 2022 at 13:04 UTC
> Unicode is hard. I agree, but what they did is even "harder". Identifiers with unicodes are case sensitive, without are case insensitive. update TM1 treats ASCII object names as case-insensitive; the element name SALES is equivalent to sales. A reference to either SALES, sales, or even SaLeS is considered to be a reference to a single element. Similarly, the cube name Projections is equivalent to PROJECTIONS. However, Unicode object names are not treated as case-insensitive. Consequently, a server can contain two identically named objects that varied only in case. For example, the elements NEMÈIJA and nemèija can exist in a single dimension, and each is considered a unique element Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^2: uc and German eszett "ß" (Unicode standard) by ikegami (Patriarch) on Feb 04, 2022 at 15:22 UTC
`//i` is equally acceptable. That's why `qr/(?<=ss)x/ui` fails in versions of Perl without experimental (and buggy) variable look-behinds. `/u` is added by `use feature qw( unicode_strings );` since 5.14, and thus by `use v5.14;` and higher. Also, `$ perl -Mre=debug -e'qr/[abc\xDF]/ui' Compiling REx "[abc\xDF]" ~ tying lastbr BRANCH (4) to ender TAIL (15) offset 11 ~ tying lastbr BRANCH (1) to ender END (16) offset 15 Final program: 1: BRANCH (4) 2: EXACTFUP <ss> (16) 4: BRANCH (FAIL) 5: ANYOF[ABCabc] (16) 15: TAIL (16) 16: END (0) minlen 1 Freeing REx: "[abc\xDF]"` [download]	[reply] [d/l] [select]
Re^3: uc and German eszett "ß" (Unicode standard) by LanX (Saint) on Feb 04, 2022 at 17:08 UTC
I think `fc` and `//i` should have the same inner semantics. But `//i` doesn't help in our case since we needed hash-keys. Anyway this was already solved on another level ... Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^4: uc and German eszett "ß" (Unicode standard) by ikegami (Patriarch) on Feb 05, 2022 at 22:04 UTC
Re^5: uc and German eszett "ß" (Unicode standard) by LanX (Saint) on Feb 05, 2022 at 22:13 UTC
Some notes below your chosen depth have not been shown here
Re: uc and German eszett "ß" by ikegami (Patriarch) on Feb 04, 2022 at 15:07 UTC
Perl's `uc` complies with Unicode's uppercasing rules ("when Unicode rules are in effect") Case translation operators use the Unicode case translation tables. Ref: latest perlunicode latest perlunicode at time of writing So `uc("ß")` will change if and when Unicode changes. The latest version of Unicode (14.0.0) indicates the uppercase of "ß" (LATIN SMALL LETTER SHARP S) is "SS". (The fourth field.) `00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S` [download] Ref: Latest case special case rules Latest case special case rules at time of writing Seeing as the ẞ (LATIN CAPITAL LETTER SHARP S) has been part of Unicode since version 5.1 in 2008, this isn't likely to change.	[reply] [d/l] [select]
Re^2: uc and German eszett "ß" by LanX (Saint) on Feb 04, 2022 at 15:13 UTC
--> Yep! If and how this will change is a matter of politics ... (but I'm pretty ignorant about how Unicode adapts to changes in orthography) Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]

edit

update

update

update