Thanks for the feedback. I don't have a Debian available; I'm running Cygwin with Perlbrew and was able to wind back to v5.32.0 (the closest I have to your v5.32.1). Under that version I have Unicode::UCD 0.75 and Encode 3.06 — what do you have? Here's a few tests.
$ perl -v | head -2 | tail -1 This is perl 5, version 32, subversion 0 (v5.32.0) built for cygwin-th +read-multi
I saw the three vowels (WITH DIAERESIS) on the web page. They didn't change when I pasted them onto my command line; nor in the uparse output. However, when I pasted the results back here:
$ uparse äöü ============================================================ String: 'äöü' ============================================================ ä U+E4 LATIN SMALL LETTER A WITH DIAERESIS ö U+F6 LATIN SMALL LETTER O WITH DIAERESIS ü U+FC LATIN SMALL LETTER U WITH DIAERESIS ------------------------------------------------------------
And just so that you know what I'm seeing:
$ uparse äöü ============================================================ String: 'äöü' ============================================================ à U+C3 LATIN CAPITAL LETTER A WITH TILDE ¤ U+A4 CURRENCY SIGN à U+C3 LATIN CAPITAL LETTER A WITH TILDE ¶ U+B6 PILCROW SIGN à U+C3 LATIN CAPITAL LETTER A WITH TILDE ¼ U+BC VULGAR FRACTION ONE QUARTER ------------------------------------------------------------
There were no surprises with my other tests.
$ uparse ���
============================================================
String: '���'
============================================================
� U+FFFD REPLACEMENT CHARACTER
� U+FFFD REPLACEMENT CHARACTER
� U+FFFD REPLACEMENT CHARACTER
------------------------------------------------------------
$ uparse 👨🦳👧👦
============================================================
String: '👨🦳👧👦'
============================================================
👨 U+1F468 MAN
U+200D ZERO WIDTH JOINER
🦳 U+1F9B3 EMOJI COMPONENT WHITE HAIR
U+200D ZERO WIDTH JOINER
👧 U+1F467 GIRL
U+200D ZERO WIDTH JOINER
👦 U+1F466 BOY
------------------------------------------------------------
$ uparse 👨🏽✈️
============================================================
String: '👨🏽✈️'
============================================================
👨 U+1F468 MAN
🏽 U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4
U+200D ZERO WIDTH JOINER
✈ U+2708 AIRPLANE
U+FE0F VARIATION SELECTOR-16
------------------------------------------------------------
$ uparse X🩼X
============================================================
String: 'X🩼X'
============================================================
X U+58 LATIN CAPITAL LETTER X
� U+1FA7C <unknown> Perl v5.32.0 supports Unicode 13.0.0
X U+58 LATIN CAPITAL LETTER X
------------------------------------------------------------
$ uparse `perl -C -e 'print "X\x{1fa7d}X"'`
============================================================
String: 'XX'
============================================================
X U+58 LATIN CAPITAL LETTER X
� U+1FA7D <unknown> Perl v5.32.0 supports Unicode 13.0.0
X U+58 LATIN CAPITAL LETTER X
------------------------------------------------------------
You mentioned "locale setup" but didn't say what you have. I have:
LANG=en_AU.UTF-8 LC_ALL=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 LC_CTYPE=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_NUMERIC=en_AU.UTF-8 LC_TIME=en_AU.UTF-8
That's the best I can do. Perhaps someone with the same O/S and Perl version as you can shed more light on your problem.
— Ken
In reply to Re^2: uparse - Parse Unicode strings
by kcott
in thread uparse - Parse Unicode strings
by kcott
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |