in reply to Re: uparse - Parse Unicode strings
in thread uparse - Parse Unicode strings
Thanks for the feedback. I don't have a Debian available; I'm running Cygwin with Perlbrew and was able to wind back to v5.32.0 (the closest I have to your v5.32.1). Under that version I have Unicode::UCD 0.75 and Encode 3.06 — what do you have? Here's a few tests.
$ perl -v | head -2 | tail -1 This is perl 5, version 32, subversion 0 (v5.32.0) built for cygwin-th +read-multi
I saw the three vowels (WITH DIAERESIS) on the web page. They didn't change when I pasted them onto my command line; nor in the uparse output. However, when I pasted the results back here:
$ uparse äöü ============================================================ String: 'äöü' ============================================================ ä U+E4 LATIN SMALL LETTER A WITH DIAERESIS ö U+F6 LATIN SMALL LETTER O WITH DIAERESIS ü U+FC LATIN SMALL LETTER U WITH DIAERESIS ------------------------------------------------------------
And just so that you know what I'm seeing:
$ uparse äöü ============================================================ String: 'äöü' ============================================================ à U+C3 LATIN CAPITAL LETTER A WITH TILDE ¤ U+A4 CURRENCY SIGN à U+C3 LATIN CAPITAL LETTER A WITH TILDE ¶ U+B6 PILCROW SIGN à U+C3 LATIN CAPITAL LETTER A WITH TILDE ¼ U+BC VULGAR FRACTION ONE QUARTER ------------------------------------------------------------
There were no surprises with my other tests.
$ uparse ��� ============================================================ String: '���' ============================================================ � U+FFFD REPLACEMENT CHARACTER � U+FFFD REPLACEMENT CHARACTER � U+FFFD REPLACEMENT CHARACTER ------------------------------------------------------------ $ uparse 👨🦳👧👦 ============================================================ String: '👨🦳👧👦' ============================================================ 👨 U+1F468 MAN U+200D ZERO WIDTH JOINER 🦳 U+1F9B3 EMOJI COMPONENT WHITE HAIR U+200D ZERO WIDTH JOINER 👧 U+1F467 GIRL U+200D ZERO WIDTH JOINER 👦 U+1F466 BOY ------------------------------------------------------------ $ uparse 👨🏽✈️ ============================================================ String: '👨🏽✈️' ============================================================ 👨 U+1F468 MAN 🏽 U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4 U+200D ZERO WIDTH JOINER ✈ U+2708 AIRPLANE U+FE0F VARIATION SELECTOR-16 ------------------------------------------------------------ $ uparse X🩼X ============================================================ String: 'X🩼X' ============================================================ X U+58 LATIN CAPITAL LETTER X � U+1FA7C <unknown> Perl v5.32.0 supports Unicode 13.0.0 X U+58 LATIN CAPITAL LETTER X ------------------------------------------------------------ $ uparse `perl -C -e 'print "X\x{1fa7d}X"'` ============================================================ String: 'XX' ============================================================ X U+58 LATIN CAPITAL LETTER X � U+1FA7D <unknown> Perl v5.32.0 supports Unicode 13.0.0 X U+58 LATIN CAPITAL LETTER X ------------------------------------------------------------
You mentioned "locale setup" but didn't say what you have. I have:
LANG=en_AU.UTF-8 LC_ALL=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 LC_CTYPE=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_NUMERIC=en_AU.UTF-8 LC_TIME=en_AU.UTF-8
That's the best I can do. Perhaps someone with the same O/S and Perl version as you can shed more light on your problem.
— Ken
|
---|