in reply to uparse - Parse Unicode strings

Brilliant work kcott!

Everything I've tested so far works like a charm on my Ubuntu Linux VM (running perl v5.38.0 built from source as described here).

A lot more convenient than the crude hack I was using, namely to click on the little xml link on a post to see the decimal values of the Unicode emojis. For example, clicking on the xml link on your post now allows me to see:

... difficult to tell them apart; e.g. <tt>&#129489;</tt> & <tt>&#1281 +04;</tt>.

which I can then crudely translate back and forth between hex and decimal via one liners such as:

C:\> perl -e "printf q{%X}, 129489" 1F9D1 C:\> perl -e "printf q{%d}, 0x1F9D1" 129489

That was working fine until the Discipulus posted an emoji to me in the Chatterbox the other day ... and, oops, there was no xml link to click on! :)

👁️🍾👍🦟

Replies are listed 'Best First'.
Re^2: uparse - Parse Unicode strings
by kcott (Archbishop) on Nov 18, 2023 at 15:15 UTC

    I'm glad you liked it.

    It was actually prompted when looking at "Emojis for Perl Monk names" and being unable to determine what the emoji for tye was. Now that I know, it seems obvious:

    $ uparse 👔
    
    ============================================================
    String: '👔'
    ============================================================
    👔      U+1F454  NECKTIE
    ------------------------------------------------------------
    

    The emoji for gellyfish didn't even render for me; but I was still able to get information about it.

    $ uparse 🪼
    
    ============================================================
    String: '🪼'
    ============================================================
    🪼      U+1FABC  JELLYFISH
    ------------------------------------------------------------
    

    There's also things like the emoji for GrandFather, which I can only select as a single entity, but would benefit from some analysis.

    $ uparse 👨‍🦳‍👧‍👦
    
    ============================================================
    String: '👨‍🦳‍👧‍👦'
    ============================================================
    👨      U+1F468  MAN
            U+200D   ZERO WIDTH JOINER
    🦳      U+1F9B3  EMOJI COMPONENT WHITE HAIR
            U+200D   ZERO WIDTH JOINER
    👧      U+1F467  GIRL
            U+200D   ZERO WIDTH JOINER
    👦      U+1F466  BOY
    ------------------------------------------------------------
    

    Maybe at some future point we can add the white hair to this family setting:

    $ uparse 👨‍👧‍👦
    
    ============================================================
    String: '👨‍👧‍👦'
    ============================================================
    👨      U+1F468  MAN
            U+200D   ZERO WIDTH JOINER
    👧      U+1F467  GIRL
            U+200D   ZERO WIDTH JOINER
    👦      U+1F466  BOY
    ------------------------------------------------------------
    

    Although, maybe you can already do this with your Win11 Segoe UI Emoji font. Can you?

    — Ken

      Maybe at some future point we can add the white hair to this family setting ... maybe you can already do this with your Win11 Segoe UI Emoji font. Can you?

      You read me like a book, that's exactly what I was trying to do! :) ... and was bitterly disappointed when it didn't work.

      For completeness, I ran a simple standalone test using Windows 11 PowerShell.

      PS C:\> $joiner = [char]::ConvertFromUtf32(0x200D) PS C:\> $man = [char]::ConvertFromUtf32(0x1F468) PS C:\> $girl = [char]::ConvertFromUtf32(0x1F467) PS C:\> $boy = [char]::ConvertFromUtf32(0x1F466) PS C:\> $whitehair = [char]::ConvertFromUtf32(0x1F9B3)

      PS C:\> "$man$joiner$girl$joiner$boy"
      👨‍👧‍👦
      

      PS C:\> "$man$joiner$whitehair$joiner$girl$joiner$boy"
      👨‍🦳‍👧‍👦
      

      Running equivalent test on Ubuntu bash with echo -e produced the same depressing result. It seems you can enjoy a family emoji with a default man, but not a man with white hair. Maybe a Unicode emoji expert knows how to do it, but I don't.

      👁️🍾👍🦟

        The problem is that the final glyphs are hard-coded. Although it might look like you're providing instructions to dynamically generate the glyphs, you're really only indicating which hard-coded glyphs to use.

        The following two glyphs can only be selected as a single entity; however MAN-GIRL-BOY is hard-coded but MAN-BOY-GIRL is not.

        👨‍👧‍👦

        $ uparse 👨‍👧‍👦
        
        ============================================================
        String: '👨‍👧‍👦'
        ============================================================
        👨      U+1F468  MAN
                U+200D   ZERO WIDTH JOINER
        👧      U+1F467  GIRL
                U+200D   ZERO WIDTH JOINER
        👦      U+1F466  BOY
        ------------------------------------------------------------
        

        👨‍👦‍👧

        $ uparse 👨‍👦‍👧
        
        ============================================================
        String: '👨‍👦‍👧'
        ============================================================
        👨      U+1F468  MAN
                U+200D   ZERO WIDTH JOINER
        👦      U+1F466  BOY
                U+200D   ZERO WIDTH JOINER
        👧      U+1F467  GIRL
        ------------------------------------------------------------
        

        In PowerShell:

        PS C:\Users\ken> $joiner = [char]::ConvertFromUtf32(0x200D) PS C:\Users\ken> $man = [char]::ConvertFromUtf32(0x1F468) PS C:\Users\ken> $girl = [char]::ConvertFromUtf32(0x1F467) PS C:\Users\ken> $boy = [char]::ConvertFromUtf32(0x1F466)
        PS C:\Users\ken> "$man$joiner$girl$joiner$boy"
        👨‍👧‍👦
        PS C:\Users\ken> "$man$joiner$boy$joiner$girl"
        👨‍👦‍👧
        

        Notice how the BOY remains on the lower-right regardless of whether the GIRL is included or not. The MAN-BOY is a preset glyph; adding GIRL after that just adds another glyph (even if the ZWJ combines these two glyphs into a singly-selectable unit).

        Hopefully, at some future point, sequences of glyphs, joiners, modifiers, and so on, will act as an instruction to dynamically generate a new glyph.

        — Ken