in reply to Re: incorrect length of strings with diphthongs
in thread incorrect length of strings with diphthongs

I may be mistaken, but aren't there (at least) two ways to encode an Umlaut in Unicode? You could either use the dedicated character Ü or combine the letter U with with the diacritic character ¨

So the word "Hütte" could be 6 letters (unicode symbols) long, depending on the exact encoding and how length() is implemented? Not sure, just looking at Wikipedia: https://en.wikipedia.org/wiki/Combining_character

PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
  • Comment on Re^2: incorrect length of strings with diphthongs

Replies are listed 'Best First'.
Re^3: incorrect length of strings with diphthongs
by choroba (Cardinal) on Aug 30, 2022 at 15:42 UTC
    That's true:
    #!/usr/bin/perl use warnings; use strict; use Unicode::Normalize qw{ normalize }; my $char = "\N{LATIN SMALL LETTER U WITH DIAERESIS}"; binmode *STDOUT, ':encoding(UTF-8)'; print normalize($_, $char), ' ' for qw( D C );

    Running the output through xxd:

    00000000: 75cc 8820 c3bc 20 u.. ..

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re^3: incorrect length of strings with diphthongs
by LanX (Saint) on Aug 30, 2022 at 17:34 UTC
    Yes, I'd say it's similar with the "ethnic" modifiers of face emojis.

    But my expectation is that those modifiers don't count as character and have length 0, i.e. "Hütte" should have length 5 in both incarnations.

    > how length() is implemented?

    I may be wrong tho...

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      #!/usr/bin/perl use strict; use feature qw{ say }; use warnings; use Unicode::Normalize qw{ normalize }; use Unicode::GCString; my $char = "\N{LATIN SMALL LETTER U WITH DIAERESIS}"; binmode *STDOUT, ':encoding(UTF-8)'; for (qw( D C )) { my $n = normalize($_, $char); my $gcs = 'Unicode::GCString'->new($n); say join ' ', length($n), $n =~ s/(\X)/$1/g, $1, $gcs->chars, $gcs->columns, $gcs->length; }
      2 1 ü 2 1 1
      1 1 ü 1 1 1
      

      Update: Added the output.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        Interesting, looks like code.

        I might even be able to install those modules and try to understand the output you didn't provide (yet)!

        ;-P

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

      > I may be wrong tho...

      I certainly am...

      #!/usr/bin/perl use v5.12; use strict; use utf8; use Devel::Peek; my $trema = "\N{COMBINING DIAERESIS}"; binmode *STDOUT, ':encoding(UTF-8)'; my $huette = "Hu${trema}tte"; Dump $huette; say "$huette\'s length: ". length($huette);

      SV = PV(0x25f4a58) at 0x25266b8 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x28da368 "Hu\314\210tte"\0 [UTF8 "Hu\x{308}tte"] CUR = 7 LEN = 10 Hütte's length: 6
      That's how it looks like without codetags:

      Hütte's length: 6

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Re^3: incorrect length of strings with diphthongs
by LanX (Saint) on Aug 30, 2022 at 21:15 UTC
    It's a matter of debate if u + ¨ is an umlaut, that's really depending on the definition of umlaut.

    Interestingly it's possible to combine ü + ¨ to pile up tremas

    Hü̈tte Hü̈̈tte

    I see huts with smoking chimneys ;-)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      There's also ű in Hungarian (called "double acute" in Unicode).

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        Using two bars instead of dots is acceptable in German, there are fonts which realize the umlauts ä,ö,ü this way. (see also this )

        That's because of the weird form of Kurrent's small e being written above the vowels.

        Apparently this is also connected to the history of the Czech letter Ů ů with an o superscript.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery