comment on

Improvement: See "Re: Decoding @ARGV [Was: uparse - Parse Unicode strings]" for an improved version of the code; mostly thanks to ++jo37 and the subthread starting with "Re: uparse - Parse Unicode strings" and continued in "Decoding @ARGV [Was: uparse - Parse Unicode strings]".

In the last month or so, we've had a number of threads where emoji were discussed. Some notable examples: "Larger profile pic than 80KB?"; "Perl Secret Operator Emojis"; and "Emojis for Perl Monk names".

Many emoji have embedded characters which are difficult, or impossible, to see; for example, zero-width joiners, variation selectors, skin tone modifiers. In some cases, glyphs are so similar that it's difficult to tell them apart; e.g. 🧑 & 👨.

I wrote uparse to split emoji, strings containing emoji, and in fact any strings with Unicode characters, into their component characters.

#!/usr/bin/env perl

BEGIN {
    if ($] < 5.007003) {
        warn "$0 requires Perl v5.7.3 or later.\n";
        exit;
    }

    unless (@ARGV) {
        warn "Usage: $0 string [string ...]\n";
        exit;
    }
}

use 5.007003;
use strict;
use warnings;
use open IO => qw{:encoding(UTF-8) :std};
use constant {
    SEP1 => '=' x 60 . "\n",
    SEP2 => '-' x 60 . "\n",
    FMT => "%s\tU+%-6X %s\n",
    NO_PRINT => "\N{REPLACEMENT CHARACTER}",
};

use Encode 'decode';
use Unicode::UCD 'charinfo';

for my $raw_str (@ARGV) {
    my $str = decode('UTF-8', $raw_str);
    print "\n", SEP1;
    print "String: '$str'\n";
    print SEP1;

    for my $char (split //, $str) {
        my $code_point = ord $char;
        my $char_info = charinfo($code_point);

        if (! defined $char_info) {
            $char_info->{name} = "<unknown> Perl $^V supports Unicode 
+"
                               . Unicode::UCD::UnicodeVersion();
        }

        printf FMT, ($char =~ /^\p{Print}$/ ? $char : NO_PRINT),
                    $code_point, $char_info->{name};
    }

    print SEP2;
}
[download]

Here's a number of example runs. All use <pre> blocks; a very few didn't need this but I chose to go with consistency.

Works with ASCII (aka Unicode: C0 Controls and Basic Latin)

$ uparse X XY "X        Z"

============================================================
String: 'X'
============================================================
X       U+58     LATIN CAPITAL LETTER X
------------------------------------------------------------

============================================================
String: 'XY'
============================================================
X       U+58     LATIN CAPITAL LETTER X
Y       U+59     LATIN CAPITAL LETTER Y
------------------------------------------------------------

============================================================
String: 'X      Z'
============================================================
X       U+58     LATIN CAPITAL LETTER X
�       U+9      <control>
Z       U+5A     LATIN CAPITAL LETTER Z
------------------------------------------------------------

The two similar emoji heads (mentioned above)

$ uparse 🧑 👨

============================================================
String: '🧑'
============================================================
🧑      U+1F9D1  ADULT
------------------------------------------------------------

============================================================
String: '👨'
============================================================
👨      U+1F468  MAN
------------------------------------------------------------

A complex ZWJ sequence

$ uparse 👨🏽‍✈️

============================================================
String: '👨🏽‍✈️'
============================================================
👨      U+1F468  MAN
🏽      U+1F3FD  EMOJI MODIFIER FITZPATRICK TYPE-4
        U+200D   ZERO WIDTH JOINER
✈       U+2708   AIRPLANE
        U+FE0F   VARIATION SELECTOR-16
------------------------------------------------------------

Maps

$ uparse 🇨🇭

============================================================
String: '🇨🇭'
============================================================
🇨       U+1F1E8  REGIONAL INDICATOR SYMBOL LETTER C
🇭       U+1F1ED  REGIONAL INDICATOR SYMBOL LETTER H
------------------------------------------------------------

Handles codepoints not yet assigned; or not supported with certain Perl versions

$ uparse `perl -C -e 'print "X\x{1fa7c}X"'`

============================================================
String: 'X🩼X'
============================================================
X       U+58     LATIN CAPITAL LETTER X
🩼      U+1FA7C  CRUTCH
X       U+58     LATIN CAPITAL LETTER X
------------------------------------------------------------

$ uparse `perl -C -e 'print "X\x{1fa7c}X"'`

============================================================
String: 'X🩼X'
============================================================
X       U+58     LATIN CAPITAL LETTER X
�       U+1FA7C  <unknown> Perl v5.30.0 supports Unicode 12.1.0
X       U+58     LATIN CAPITAL LETTER X
------------------------------------------------------------

$ uparse `perl -C -e 'print "X\x{1fa7d}X"'`

============================================================
String: 'X🩽X'
============================================================
X       U+58     LATIN CAPITAL LETTER X
�       U+1FA7D  <unknown> Perl v5.39.3 supports Unicode 15.0.0
X       U+58     LATIN CAPITAL LETTER X
------------------------------------------------------------

Enjoy!

— Ken

In reply to uparse - Parse Unicode strings by kcott

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.