How to Use Pack to Convert UTF-16 Surrogate Pairs to UTF-8?

WingedKnight has asked for the wisdom of the Perl Monks concerning the following question:

I have input strings which contain text in which some characters are in UTF-16 format and escaped with '\u'. I am trying to convert all the strings to UTF-8. For example, the string 'Alice & Bob & Carol' might be formatted in the input as:

'Alice \u0026 Bob \u0026 Carol'

To do my desired conversion, I was doing...:

$str =~ s/\\u([A-Fa-f0-9]{4})/pack("U", hex($1))/eg;

...which worked fine until I got to input strings that contained UTF-16 surrogate pairs like:

'Alice \ud83d\ude06 Bob'

How do I modify the above code that uses pack to work with UTF-16 surrogate pairs? I would really like a solution that just uses pack without having to use any additional libraries (JSON::XS, Encode, etc.).

Comment on How to Use Pack to Convert UTF-16 Surrogate Pairs to UTF-8? Select or Download Code

Replies are listed 'Best First'.
Re: How to Use Pack to Convert UTF-16 Surrogate Pairs to UTF-8? by haukex (Archbishop) on Jun 09, 2022 at 08:02 UTC
I would really like a solution that just uses pack without having to use any additional libraries (JSON::XS, Encode, etc.). Why? This very much smells like an X/Y problem to me. JSON::PP has been in the core for over 11 years (since Perl 5.14). `use warnings; use strict; use JSON::PP; use Data::Dump; my $enc = '"Alice \ud83d\ude06 Bob"'; my $json = JSON::PP->new->allow_nonref; dd $json->decode($enc); # "Alice \x{1F606} Bob"` [download] Minor edit for clarification.	[reply] [d/l]
Re: How to Use Pack to Convert UTF-16 Surrogate Pairs to UTF-8? by graff (Chancellor) on Jun 09, 2022 at 03:30 UTC
I'm just coming back from having been away from Perl for awhile, and this was a very challenging problem that allowed me to learn some useful stuff -- so thank you very much for posting the question. It's helpful/important to point out that the wide-character escape form that you have there is based on UTF-16BE ("big-endian", aka high-byte-first) -- this is also known (as mentioned in the manual for the `pack` function) as "network" order. I found that the following approach seems to work; note that I'm looping over the surrogate pairs first, then looping over the "normal" characters (the ones with code points less than 0x10000) -- and I print out the results at each iteration to see what's going on. `#!/usr/bin/perl -CS use strict; use warnings; use Encode qw(decode); $_ = <DATA>; print; while (/(\\u(d8[0-9a-f]{2})\\u(d[c-f][0-9a-f]{2}))/ ) { ## NB: Match +only surrogate pairs my $rplc = $1; my $sp = pack( "nn", hex($2), hex($3) ); s/\Q$rplc/decode( "UTF-16BE", $sp )/e; print; } while (/(\\u([0-9a-f]{4}))/ ) { my $rplc = $1; my $cp = pack( "n", hex($2) ); s/\Q$rplc/decode( "UTF-16BE", $cp )/e; print; } __DATA__ Ren\u00e9 \ud83d\ude06 Fran\u00e7oise` [download] Output: Ren\u00e9 \ud83d\ude06 Fran\u00e7oise Ren\u00e9 😆 Fran\u00e7oise René 😆 Fran\u00e7oise René 😆 Françoise IMPORTANT UPDATE: I altered the initial regex above -- the one for matching surrogate-pair escape sequences -- to ensure that the two escapes ~~both start with the hex-digit "d"~~ start with "d8" (for the first surrogate) and `"d[c-f]"` (for the second surrogate); this avoids misfiring on cases where two non-surrogate characters happen to appear next to each other in the data.	[reply] [d/l] [select]
Re^2: How to Use Pack to Convert UTF-16 Surrogate Pairs to UTF-8? (updated) by haukex (Archbishop) on Jun 09, 2022 at 12:45 UTC
I still think it's likely the OP is taking the wrong approach by trying to hand-roll a JSON decoder, so I did want to point out a few issues with your code - hopefully thereby also pointing out some of the pitfalls of hand-rolled approaches. High surrogates range from U+D800 to U+DBFF, which your regex doesn't cover (e.g. `unpack("H*", encode("UTF-16BE", "\N{VARIATION SELECTOR-256}")) eq "db40ddef"`). Your regexes should probably also handle uppercase hex digits. You might want to pass `Encode::FB_CROAK` to `decode`. You don't need to loop over the strings with a regex and then a second regex, that's fairly inefficient; it can all be done in one regex. <Reveal this spoiler or all in this thread>	[reply] [d/l] [select]
Re^2: How to Use Pack to Convert UTF-16 Surrogate Pairs to UTF-8? by ikegami (Patriarch) on Jun 10, 2022 at 04:40 UTC
This incorrectly handles `{ "a": "\\u2660" }` [download]	[reply] [d/l]
Re^2: How to Use Pack to Convert UTF-16 Surrogate Pairs to UTF-8? by WingedKnight (Novice) on Jun 14, 2022 at 02:44 UTC
I'm just coming back from having been away from Perl for awhile, and this was a very challenging problem that allowed me to learn some useful stuff -- so thank you very much for posting the question. Thank you for writing out your code. I too learned useful stuff from studying your code. :)	[reply]
Re: How to Use Pack to Convert UTF-16 Surrogate Pairs to UTF-8? by NERDVANA (Deacon) on Jun 09, 2022 at 01:01 UTC
If you don't know the encoding of your input, a cheap hack to "fix it" is `utf8::decode($string);` Call it multiple times if you think the input might be multiply utf8 encoded. Strictly speaking, this is wrong, and could damage real unicode strings that happen to look like UTF8 sequences. Practically speaking, it just "fixes things" and you can get on with the rest of your work. The only correct way to decode things is to know the encoding that was given to your program, then use the Encode module. BTW, the Encode module is a core perl module, and not something you should try to avoid. As a sidenote, I would use `chr(hex $1)` instead of `pack("U", hex($1))`	[reply] [d/l] [select]


Problems? Is your data what you think it is?
	PerlMonks