LanX has asked for the wisdom of the Perl Monks concerning the following question:

Hi

A colleague kept complaining today that 5.24's uc produced "SS" from uc("ß")

I tried explaining to him that this is according to standard orthographic rules taught in school.°

(Anyway he kept blaming Perl ... ;-)

NOW ... they actually invented and standardized a capital form some years ago, see https://en.wikipedia.org/wiki/Capital_%E1%BA%9E#Development_of_a_capital_form

Questions

I see a potential upgrading problem arising from there ...

my results with Strawberry Perl so far on a German Win version

use strict; use warnings; use utf8; $\="\n"; print "Perlversion $]"; print "$_ -> ",ord($_) for "ß", "\Uß", uc("ß");

C:/Strawberry/perl/bin\perl.exe -w d:/tmp/job/eszet.pl Perlversion 5.032001 ß -> 223 SS -> 83 SS -> 83

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

°) "ß" is a ligature which developed centuries ago from of two old "s" variants, long story ...

... for comparison, I occasionally see "oe" ligatures in French loan words in English texts

edit

erroneously posted in PMD, moved to SOPW

Replies are listed 'Best First'.
Re: uc and German eszett "ß"
by hippo (Archbishop) on Feb 01, 2022 at 20:35 UTC

    That is the very character used as an example in the doc for lc - does that help to clarify things in terms of how the locale, utf-8 flag, bytes pragma etc. affect it all?


    🦛

      > does that help to clarify things in terms of how the locale, utf-8 flag, bytes pragma etc. affect it all?

      hmm ... I'm still confused. It seems lc works well while uc wasn't updated yet. Which is counterintuitive.

      use strict; use warnings; use utf8; use open qw(:std :utf8); $\="\n"; print "Perlversion $]"; my $SS = "\x{1E9E}"; no locale; print "=== local off LANG=$ENV{LANG}"; print "* TEST UC"; print "$_ -> ",ord($_) for "ß", "\Uß", uc("ß"); print "* TEST LC"; print "$_ -> ",ord($_) for $SS, "\L$SS", lc($SS); use locale; print "=== local on LANG=$ENV{LANG}"; print "* TEST UC"; print "$_ -> ",ord($_) for "ß", "\Uß", uc("ß"); print "* TEST LC"; print "$_ -> ",ord($_) for $SS, "\L$SS", lc($SS);

      Can't do lc("\x{1E9E}") on non-UTF-8 locale; resolved to "\x{1E9E}". a +t d:/tmp/job/eszet.pl line 33. Can't do lc("\x{1E9E}") on non-UTF-8 locale; resolved to "\x{1E9E}". a +t d:/tmp/job/eszet.pl line 33. Perlversion 5.032001 === local off LANG=DEU * TEST UC ß -> 223 SS -> 83 SS -> 83 * TEST LC ẞ -> 7838 ß -> 223 ß -> 223 === local on LANG=DEU * TEST UC ß -> 223 ß -> 223 ß -> 223 * TEST LC ẞ -> 7838 ẞ -> 7838 ẞ -> 7838

      NB: the warnings happen only when local is used. Which deactivates all conversion here.

      Furthermore is ẞ a display problem of the monastery's code blocks, the character prints well inside my emacs.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      update

      I suppose Perl follows "unicode rules", but those haven't been updated yet to new "German rules" ...

Re: uc and German eszett "ß"
by kcott (Archbishop) on Feb 02, 2022 at 10:09 UTC

    G'day Rolf,

    Here's all of the variations that I could think of:

    $ perl -v | head -2 | tail -1 This is perl 5, version 34, subversion 0 (v5.34.0) built for cygwin-th +read-multi $ echo $LANG en_AU.UTF-8 $ alias perlu alias perlu='perl -Mstrict -Mwarnings -Mautodie=:all -Mutf8 -C -E'
    $ perlu '
        say "$_ -> ", ord($_) for
            "ß", "\Uß", uc("ß"),
            "ẞ", "\Lẞ", lc("ẞ"),
            "\Fß", fc("ß"),
            "\Fẞ", fc("ẞ");
    '
    ß -> 223
    SS -> 83
    SS -> 83
    ẞ -> 7838
    ß -> 223
    ß -> 223
    ss -> 115
    ss -> 115
    ss -> 115
    ss -> 115
    

    From "Re^2: uc and German eszett "ß"":

    "Furthermore is ẞ a display problem of the monastery's code blocks, the character prints well inside my emacs."

    When using non-ASCII characters, I replace "code" with "pre" and "c" with "tt". I think the problem is more to do with PM's encoding than a specific code block issue; for example, you'll get the same rendering of entities, rather than characters, in paragraph text. Someone more knowlegeable may have a better (more complete) answer to that.

    Update: I removed four instances of ken@titan ~/tmp that preceded each of the commands above. I had originally just done a copy-paste from my screen, but that information is irrelevant clutter.

    — Ken

      The HTML specs are not very specific about how "code" vs "pre" really works. It's mostly on the order of "Dear Browser! FYI, this part is some sort of program code thing. Please do something about it if you want (but you are not required to)." Basically, the code tag is says "here is some text using the "monospace" font family"

      The "pre" is just as vague tag preserves linebreaks and other whitespace characters and uses a fixed-width font. And again, that is pretty much all that the standard says about that, as far as i can tell.

      perl -e 'use Crypt::Digest::SHA256 qw[sha256_hex]; print substr(sha256_hex("the Answer To Life, The Universe And Everything"), 6, 2), "\n";'
        HTML spec is irrelevant here, PerlMonks interprets <code> in its own way.

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

        As ++choroba correctly points out, PM <code> is not HTML <code>. See "Markup in the Monastery".

        The PM <code> tag provides some conveniences. It automatically handles certain special characters; for instance, you can paste code with $x < $y without having to manually change that to $x &lt; $y. It also adds the "download" link for blocks of code.

        The <code> and <c> are interchangeable. I usually use the former for blocks and the latter for inline: that's just a personal preference.

        With <pre> and <tt>, you will need to manually edit special characters; accordingly, I try to keep these as small as possible. You also don't get the "download" link.

        — Ken

Re: uc and German eszett "ß" (Unicode standard)
by LanX (Saint) on Feb 02, 2022 at 10:47 UTC
    Mystery solved: After reading up the docs for fc I realized that Perl implements this Unicode standard for cases:

    The docs also claim that fc is the only acceptable solution for comparing "case insensitive strings".

    Unfortunately are we working with an IBM product°, which sees a difference between fields named "Straße" and "Strasse", while fc (and German readers) will consider them identical.

    Now regarding uc of "ß", that's to be decided between the Unicode committee and German language authorities.

    (this is not so much a problem for German readers but for CS and "equality of strings")

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

    update

    °) which is surprisingly fucked up in this respect https://www.ibm.com/docs/en/cognos-tm1/10.2.2?topic=pitf-unicodeupperlowercase-1

      Unicode is hard. IBM is not to blame, we're having problems with ICU and Postgres, too.
      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        > Unicode is hard.

        I agree, but what they did is even "harder".

        Identifiers with unicodes are case sensitive, without are case insensitive.

        update

          TM1 treats ASCII object names as case-insensitive; the element name SALES is equivalent to sales. A reference to either SALES, sales, or even SaLeS is considered to be a reference to a single element. Similarly, the cube name Projections is equivalent to PROJECTIONS.

          However, Unicode object names are not treated as case-insensitive. Consequently, a server can contain two identically named objects that varied only in case. For example, the elements NEMÈIJA and nemèija can exist in a single dimension, and each is considered a unique element

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

      //i is equally acceptable.

      That's why qr/(?<=ss)x/ui fails in versions of Perl without experimental (and buggy) variable look-behinds. /u is added by use feature qw( unicode_strings ); since 5.14, and thus by use v5.14; and higher.

      Also,

      $ perl -Mre=debug -e'qr/[abc\xDF]/ui' Compiling REx "[abc\xDF]" ~ tying lastbr BRANCH (4) to ender TAIL (15) offset 11 ~ tying lastbr BRANCH (1) to ender END (16) offset 15 Final program: 1: BRANCH (4) 2: EXACTFUP <ss> (16) 4: BRANCH (FAIL) 5: ANYOF[ABCabc] (16) 15: TAIL (16) 16: END (0) minlen 1 Freeing REx: "[abc\xDF]"
        I think fc and //i should have the same inner semantics.

        But //i doesn't help in our case since we needed hash-keys.

        Anyway this was already solved on another level ...

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

Re: uc and German eszett "ß"
by ikegami (Patriarch) on Feb 04, 2022 at 15:07 UTC
      --> Yep!

      If and how this will change is a matter of politics ...

      (but I'm pretty ignorant about how Unicode adapts to changes in orthography)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery