Improvement: See "Re: Decoding @ARGV [Was: uparse - Parse Unicode strings]" for an improved version of the code; mostly thanks to ++jo37 and the subthread starting with "Re: uparse - Parse Unicode strings" and continued in "Decoding @ARGV [Was: uparse - Parse Unicode strings]".
In the last month or so, we've had a number of threads where emoji were discussed. Some notable examples: "Larger profile pic than 80KB?"; "Perl Secret Operator Emojis"; and "Emojis for Perl Monk names".
Many emoji have embedded characters which are difficult, or impossible, to see; for example, zero-width joiners, variation selectors, skin tone modifiers. In some cases, glyphs are so similar that it's difficult to tell them apart; e.g. 🧑 & 👨.
I wrote uparse to split emoji, strings containing emoji, and in fact any strings with Unicode characters, into their component characters.
#!/usr/bin/env perl BEGIN { if ($] < 5.007003) { warn "$0 requires Perl v5.7.3 or later.\n"; exit; } unless (@ARGV) { warn "Usage: $0 string [string ...]\n"; exit; } } use 5.007003; use strict; use warnings; use open IO => qw{:encoding(UTF-8) :std}; use constant { SEP1 => '=' x 60 . "\n", SEP2 => '-' x 60 . "\n", FMT => "%s\tU+%-6X %s\n", NO_PRINT => "\N{REPLACEMENT CHARACTER}", }; use Encode 'decode'; use Unicode::UCD 'charinfo'; for my $raw_str (@ARGV) { my $str = decode('UTF-8', $raw_str); print "\n", SEP1; print "String: '$str'\n"; print SEP1; for my $char (split //, $str) { my $code_point = ord $char; my $char_info = charinfo($code_point); if (! defined $char_info) { $char_info->{name} = "<unknown> Perl $^V supports Unicode +" . Unicode::UCD::UnicodeVersion(); } printf FMT, ($char =~ /^\p{Print}$/ ? $char : NO_PRINT), $code_point, $char_info->{name}; } print SEP2; }
Here's a number of example runs. All use <pre> blocks; a very few didn't need this but I chose to go with consistency.
Works with ASCII (aka Unicode: C0 Controls and Basic Latin)
$ uparse X XY "X Z" ============================================================ String: 'X' ============================================================ X U+58 LATIN CAPITAL LETTER X ------------------------------------------------------------ ============================================================ String: 'XY' ============================================================ X U+58 LATIN CAPITAL LETTER X Y U+59 LATIN CAPITAL LETTER Y ------------------------------------------------------------ ============================================================ String: 'X Z' ============================================================ X U+58 LATIN CAPITAL LETTER X � U+9 <control> Z U+5A LATIN CAPITAL LETTER Z ------------------------------------------------------------
The two similar emoji heads (mentioned above)
$ uparse 🧑 👨 ============================================================ String: '🧑' ============================================================ 🧑 U+1F9D1 ADULT ------------------------------------------------------------ ============================================================ String: '👨' ============================================================ 👨 U+1F468 MAN ------------------------------------------------------------
A complex ZWJ sequence
$ uparse 👨🏽✈️ ============================================================ String: '👨🏽✈️' ============================================================ 👨 U+1F468 MAN 🏽 U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4 U+200D ZERO WIDTH JOINER ✈ U+2708 AIRPLANE U+FE0F VARIATION SELECTOR-16 ------------------------------------------------------------
Maps
$ uparse 🇨🇭 ============================================================ String: '🇨🇭' ============================================================ 🇨 U+1F1E8 REGIONAL INDICATOR SYMBOL LETTER C 🇭 U+1F1ED REGIONAL INDICATOR SYMBOL LETTER H ------------------------------------------------------------
Handles codepoints not yet assigned; or not supported with certain Perl versions
$ uparse `perl -C -e 'print "X\x{1fa7c}X"'` ============================================================ String: 'X🩼X' ============================================================ X U+58 LATIN CAPITAL LETTER X 🩼 U+1FA7C CRUTCH X U+58 LATIN CAPITAL LETTER X ------------------------------------------------------------ $ uparse `perl -C -e 'print "X\x{1fa7c}X"'` ============================================================ String: 'X🩼X' ============================================================ X U+58 LATIN CAPITAL LETTER X � U+1FA7C <unknown> Perl v5.30.0 supports Unicode 12.1.0 X U+58 LATIN CAPITAL LETTER X ------------------------------------------------------------ $ uparse `perl -C -e 'print "X\x{1fa7d}X"'` ============================================================ String: 'XX' ============================================================ X U+58 LATIN CAPITAL LETTER X � U+1FA7D <unknown> Perl v5.39.3 supports Unicode 15.0.0 X U+58 LATIN CAPITAL LETTER X ------------------------------------------------------------
Enjoy!
— Ken
In reply to uparse - Parse Unicode strings by kcott
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |