Improvement: See "Re: Decoding @ARGV [Was: uparse - Parse Unicode strings]" for an improved version of the code; mostly thanks to ++jo37 and the subthread starting with "Re: uparse - Parse Unicode strings" and continued in "Decoding @ARGV [Was: uparse - Parse Unicode strings]".

In the last month or so, we've had a number of threads where emoji were discussed. Some notable examples: "Larger profile pic than 80KB?"; "Perl Secret Operator Emojis"; and "Emojis for Perl Monk names".

Many emoji have embedded characters which are difficult, or impossible, to see; for example, zero-width joiners, variation selectors, skin tone modifiers. In some cases, glyphs are so similar that it's difficult to tell them apart; e.g. 🧑 & 👨.

I wrote uparse to split emoji, strings containing emoji, and in fact any strings with Unicode characters, into their component characters.

#!/usr/bin/env perl BEGIN { if ($] < 5.007003) { warn "$0 requires Perl v5.7.3 or later.\n"; exit; } unless (@ARGV) { warn "Usage: $0 string [string ...]\n"; exit; } } use 5.007003; use strict; use warnings; use open IO => qw{:encoding(UTF-8) :std}; use constant { SEP1 => '=' x 60 . "\n", SEP2 => '-' x 60 . "\n", FMT => "%s\tU+%-6X %s\n", NO_PRINT => "\N{REPLACEMENT CHARACTER}", }; use Encode 'decode'; use Unicode::UCD 'charinfo'; for my $raw_str (@ARGV) { my $str = decode('UTF-8', $raw_str); print "\n", SEP1; print "String: '$str'\n"; print SEP1; for my $char (split //, $str) { my $code_point = ord $char; my $char_info = charinfo($code_point); if (! defined $char_info) { $char_info->{name} = "<unknown> Perl $^V supports Unicode +" . Unicode::UCD::UnicodeVersion(); } printf FMT, ($char =~ /^\p{Print}$/ ? $char : NO_PRINT), $code_point, $char_info->{name}; } print SEP2; }

Here's a number of example runs. All use <pre> blocks; a very few didn't need this but I chose to go with consistency.

Works with ASCII (aka Unicode: C0 Controls and Basic Latin)

$ uparse X XY "X        Z"

============================================================
String: 'X'
============================================================
X       U+58     LATIN CAPITAL LETTER X
------------------------------------------------------------

============================================================
String: 'XY'
============================================================
X       U+58     LATIN CAPITAL LETTER X
Y       U+59     LATIN CAPITAL LETTER Y
------------------------------------------------------------

============================================================
String: 'X      Z'
============================================================
X       U+58     LATIN CAPITAL LETTER X
�       U+9      <control>
Z       U+5A     LATIN CAPITAL LETTER Z
------------------------------------------------------------

The two similar emoji heads (mentioned above)

$ uparse 🧑 👨

============================================================
String: '🧑'
============================================================
🧑      U+1F9D1  ADULT
------------------------------------------------------------

============================================================
String: '👨'
============================================================
👨      U+1F468  MAN
------------------------------------------------------------

A complex ZWJ sequence

$ uparse 👨🏽‍✈️

============================================================
String: '👨🏽‍✈️'
============================================================
👨      U+1F468  MAN
🏽      U+1F3FD  EMOJI MODIFIER FITZPATRICK TYPE-4
        U+200D   ZERO WIDTH JOINER
✈       U+2708   AIRPLANE
        U+FE0F   VARIATION SELECTOR-16
------------------------------------------------------------

Maps

$ uparse 🇨🇭

============================================================
String: '🇨🇭'
============================================================
🇨       U+1F1E8  REGIONAL INDICATOR SYMBOL LETTER C
🇭       U+1F1ED  REGIONAL INDICATOR SYMBOL LETTER H
------------------------------------------------------------

Handles codepoints not yet assigned; or not supported with certain Perl versions

$ uparse `perl -C -e 'print "X\x{1fa7c}X"'`

============================================================
String: 'X🩼X'
============================================================
X       U+58     LATIN CAPITAL LETTER X
🩼      U+1FA7C  CRUTCH
X       U+58     LATIN CAPITAL LETTER X
------------------------------------------------------------

$ uparse `perl -C -e 'print "X\x{1fa7c}X"'`

============================================================
String: 'X🩼X'
============================================================
X       U+58     LATIN CAPITAL LETTER X
�       U+1FA7C  <unknown> Perl v5.30.0 supports Unicode 12.1.0
X       U+58     LATIN CAPITAL LETTER X
------------------------------------------------------------

$ uparse `perl -C -e 'print "X\x{1fa7d}X"'`

============================================================
String: 'X🩽X'
============================================================
X       U+58     LATIN CAPITAL LETTER X
�       U+1FA7D  <unknown> Perl v5.39.3 supports Unicode 15.0.0
X       U+58     LATIN CAPITAL LETTER X
------------------------------------------------------------

Enjoy!

— Ken


In reply to uparse - Parse Unicode strings by kcott

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.