Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re: uparse - Parse Unicode strings

by Tux (Canon)
on Nov 18, 2023 at 10:05 UTC ( [id://11155676]=note: print w/replies, xml ) Need Help??


in reply to uparse - Parse Unicode strings

The penguin is part of my prompt

Download uchar

tux 🐧 uchar --help
usage: uchar -v [-m base:count [ -m base:count ] ...
       uchar -v -f char ...
  perl 5.38.0 with Unicode 15.0.0

        -m      show maps
        -v      verbosity
        -l      list GBA characters
        -f      find
        -F      find (only chars supported in current font)
         -s     splash all characters found into a single string
        -k      show matching key combo(s)
        -d      apply random diacricals
        -e      show character encodings (uchar -e -f u_BREVE)
         -o     also show octal version of encoding
        -E      show character decodings (uchar -E fc)
        -b      strip to base
        -D      show codepoints in decimal
        -c      copy found string(s) to clipboard
        -h      also show html entity if available
tux 🐧 uchar -v X🩼X
X U00058 \N{LATIN CAPITAL LETTER X}
🩼 U1fa7c \N{CRUTCH}
X U00058 \N{LATIN CAPITAL LETTER X}
tux 🐧 uchar -v U+1f427
🐧 U1f427 \N{PENGUIN}
tux 🐧 uchar -e U+1f427
🐧 U1f427 \N{PENGUIN}

  cp1026                         6f
  cp1047                         6f
  cp37                           6f
  cp424                          6f
  cp500                          6f
  cp875                          6f
  gb12345-raw                    22
  gb2312-raw                     22
  hz                             22
  iso-2022-kr                    1b2429435c787b31663432377d
  iso-ir-165                     22
  jis0208-raw                    20
  jis0212-raw                    22
  ksc5601-raw                    22
  posix-bc                       6f
  UCS-2BE                        fffd
  UCS-2LE                        fdff
  UTF-16                         feffd83ddc27
  UTF-16BE                       d83ddc27
  UTF-16LE                       3dd827dc
  UTF-32                         0000feff0001f427
  UTF-32BE                       0001f427
  UTF-32LE                       27f40100
  UTF-7                          2b324433634a772d
  utf-8-strict                   f09f90a7
  utf8                           f09f90a7
tux 🐧 uchar -E f09f90a7 | grep utf
  utf-8-strict                   🐧
  utf8                           🐧     (U+1F427)
tux 🐧 uchar -Fk "L WITH STROKE"
Searching for (?^u:\bL WITH STROKE\b)
000141 Ł LSTROKE_IDX     LATIN CAPITAL LETTER L WITH STROKE
         #<Multi_key> <L> <minus>
         #<Multi_key> <minus> <L>
         <Multi_key> <L> <slash>
         <Multi_key> <L> <underscore>
         <Multi_key> <slash> <L>
         <Multi_key> <underscore> <L>
000142 ł lSTROKE_IDX     LATIN SMALL LETTER L WITH STROKE
         #<Multi_key> <l> <minus>
         #<Multi_key> <minus> <l>
         <Multi_key> <l> <slash>
         <Multi_key> <l> <underscore>
         <Multi_key> <slash> <l>
         <Multi_key> <underscore> <l>
tux $ perl -CEO -wE'say "\x{1F468}\x{1F3FD}\x{200D}\x{2708}\x{FE0F}"'
👨🏽✈️
tux $ raku -e'"\x[1F468]\x[1F3FD]\x[200D]\x[2708]\x[FE0F]".say'
👨🏽✈️
tux $ raku -e'"\x[1F468]\x[1F3FD]\x[200D]\x[2708]\x[FE0F]".say' | xarg +s uchar -v
👨 U1f468 \N{MAN}
🏽 U1f3fd \N{EMOJI MODIFIER FITZPATRICK TYPE-4}
 U0200d \N{ZERO WIDTH JOINER}
✈ U02708 \N{AIRPLANE}
️ U0fe0f \N{VARIATION SELECTOR-16}

Enjoy, Have FUN! H.Merijn

Replies are listed 'Best First'.
Re^2: uparse - Parse Unicode strings
by kcott (Archbishop) on Nov 18, 2023 at 13:55 UTC

    That's a very comprehensive solution with substantially more functionality than I needed. It probably deserves its own CUFP page.

    — Ken

Re^2: uparse - Parse Unicode strings
by eyepopslikeamosquito (Archbishop) on Nov 19, 2023 at 02:50 UTC

    Wow, very impressive! ... agree with kcott that it deserves its own CUFP page.

    I played briefly with your command on Ubuntu using perl v5.38:

    ~/pm/Tux$ perl -CEO -wE'say "\x{1F468}\x{1F3FD}\x{200D}\x{2708}\x{FE0F}"'
    👨🏽‍✈️
    
    ~/pm/Tux$ echo -e '\U1F468\U1F3FD\U200D\U2708\UFE0F'
    👨🏽‍✈️
    

    AFAICT, the output from the perl -CEO and the bash echo -e commands above is identical, namely:

    &#128104;&#127997;&#8205;&#9992;&#65039;

    Running this command produced useful output (that seems to match yours), despite the error messages:

    ~/pm/Tux$ echo -e '\U1F468\U1F3FD\U200D\U2708\UFE0F' | xargs uchar -v
    Can't exec "locate": No such file or directory at ~/pm/Tux/uchar line 103.
    👨 U1f468 \N{MAN}
    🏽 U1f3fd \N{EMOJI MODIFIER FITZPATRICK TYPE-4}
    ‍ U0200d \N{ZERO WIDTH JOINER}
    ✈ U02708 \N{AIRPLANE}
    ️ U0fe0f \N{VARIATION SELECTOR-16}
    

    Using CODE blocks intead of pre:

    ~/pm/Tux$ echo -e '\U1F468\U1F3FD\U200D\U2708\UFE0F' | xargs uchar -v Can't exec "locate": No such file or directory at ~/pm/Tux/uchar line +103. &#128104; U1f468 \N{MAN} &#127997; U1f3fd \N{EMOJI MODIFIER FITZPATRICK TYPE-4} &#8205; U0200d \N{ZERO WIDTH JOINER} &#9992; U02708 \N{AIRPLANE} &#65039; U0fe0f \N{VARIATION SELECTOR-16}

    👁️🍾👍🦟

      Fetch again. Now guarded. /me wonders how people work on a devel machine without mlocate :)


      Enjoy, Have FUN! H.Merijn

        > /me wonders how people work on a devel machine without mlocate :)

        Let me try to explain why I'd never heard of mlocate. :)

        We run the identical version of Perl with an identical set of CPAN modules on our many different Unix boxes (multiple versions of: AIX, HP-UX, Solaris, Red Hat Enterprise Linux (RHEL), Digital UNIX, Tru64 UNIX, IRIX, UnixWare, SCO Unix, ...).

        -- from Re: putting perl and modules in your source code repository

        When your typical work day for over twenty years has been spread across Windows boxes and many different Unix flavours, you naturally lean towards standard POSIX commands (such as find and xargs), rather than system-specific ones (such as locate/mlocate), because you know they're available out-of-the-box everywhere.

        Better, as indicated at Unix shell versus Perl, is to avoid a motley mix of Unix shell and Windows batch scripts by writing everything in Perl ("It's easier to port a shell than a shell script").

        If I had a job where I spent most of my day on a Linux development machine, it would make sense to invest considerable time in mastering Linux-specific dev tools (interested to learn BTW if you get to spend most of your work day beavering away on a Linux dev box).

        Now that I know about mlocate I might get around to installing it at home on my Ubuntu VM - more likely if you, or some other kind Perl monk, sold me with some examples of how it makes development more enjoyable. :)

        Updated: minor changes to wording.

        👁️🍾👍🦟

        Thanks! Your new version is working nicely for me now.

        BTW, I found by experimenting that it seems to work fine for my simple needs even without xargs:

        ~/pm/Tux$ echo -e '\U1F468\U1F3FD\U200D\U2708\UFE0F' | xargs ./uchar -v
        👨 U1f468 \N{MAN}
        🏽 U1f3fd \N{EMOJI MODIFIER FITZPATRICK TYPE-4}
        ‍ U0200d \N{ZERO WIDTH JOINER}
        ✈ U02708 \N{AIRPLANE}
        ️ U0fe0f \N{VARIATION SELECTOR-16}
        
        ~/pm/Tux$ ./uchar -v '\U1F468\U1F3FD\U200D\U2708\UFE0F'
        👨 U1f468 \N{MAN}
        🏽 U1f3fd \N{EMOJI MODIFIER FITZPATRICK TYPE-4}
        ‍ U0200d \N{ZERO WIDTH JOINER}
        ✈ U02708 \N{AIRPLANE}
        ️ U0fe0f \N{VARIATION SELECTOR-16}
        

        Update: Note that \U1F3FD (&#127997;) is EMOJI MODIFIER FITZPATRICK TYPE-4 : skin color modifier character representing skin type 4 from the Fitzpatrick scale, used above to change the skin color of the airline pilot. Also used by Discipulus to change the skin color of man student at Re: Emojis for Perl Monk names (Discipulus and SpaceCowboy).

        👁️🍾👍🦟

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11155676]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2024-04-19 15:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found