Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

How to Use Pack to Convert UTF-16 Surrogate Pairs to UTF-8?

by WingedKnight (Novice)
on Jun 09, 2022 at 00:26 UTC ( [id://11144531]=perlquestion: print w/replies, xml ) Need Help??

WingedKnight has asked for the wisdom of the Perl Monks concerning the following question:

I have input strings which contain text in which some characters are in UTF-16 format and escaped with '\u'. I am trying to convert all the strings to UTF-8. For example, the string 'Alice & Bob & Carol' might be formatted in the input as:

'Alice \u0026 Bob \u0026 Carol'

To do my desired conversion, I was doing...:

$str =~ s/\\u([A-Fa-f0-9]{4})/pack("U", hex($1))/eg;

...which worked fine until I got to input strings that contained UTF-16 surrogate pairs like:

'Alice \ud83d\ude06 Bob'

How do I modify the above code that uses pack to work with UTF-16 surrogate pairs? I would really like a solution that just uses pack without having to use any additional libraries (JSON::XS, Encode, etc.).

Replies are listed 'Best First'.
Re: How to Use Pack to Convert UTF-16 Surrogate Pairs to UTF-8?
by haukex (Archbishop) on Jun 09, 2022 at 08:02 UTC
    I would really like a solution that just uses pack without having to use any additional libraries (JSON::XS, Encode, etc.).

    Why? This very much smells like an X/Y problem to me. JSON::PP has been in the core for over 11 years (since Perl 5.14).

    use warnings; use strict; use JSON::PP; use Data::Dump; my $enc = '"Alice \ud83d\ude06 Bob"'; my $json = JSON::PP->new->allow_nonref; dd $json->decode($enc); # "Alice \x{1F606} Bob"

    Minor edit for clarification.

Re: How to Use Pack to Convert UTF-16 Surrogate Pairs to UTF-8?
by graff (Chancellor) on Jun 09, 2022 at 03:30 UTC

    I'm just coming back from having been away from Perl for awhile, and this was a very challenging problem that allowed me to learn some useful stuff -- so thank you very much for posting the question.

    It's helpful/important to point out that the wide-character escape form that you have there is based on UTF-16BE ("big-endian", aka high-byte-first) -- this is also known (as mentioned in the manual for the pack function) as "network" order.

    I found that the following approach seems to work; note that I'm looping over the surrogate pairs first, then looping over the "normal" characters (the ones with code points less than 0x10000) -- and I print out the results at each iteration to see what's going on.

    #!/usr/bin/perl -CS use strict; use warnings; use Encode qw(decode); $_ = <DATA>; print; while (/(\\u(d8[0-9a-f]{2})\\u(d[c-f][0-9a-f]{2}))/ ) { ## NB: Match +only surrogate pairs my $rplc = $1; my $sp = pack( "nn", hex($2), hex($3) ); s/\Q$rplc/decode( "UTF-16BE", $sp )/e; print; } while (/(\\u([0-9a-f]{4}))/ ) { my $rplc = $1; my $cp = pack( "n", hex($2) ); s/\Q$rplc/decode( "UTF-16BE", $cp )/e; print; } __DATA__ Ren\u00e9 \ud83d\ude06 Fran\u00e7oise
    Output:
    Ren\u00e9 \ud83d\ude06 Fran\u00e7oise
    Ren\u00e9 😆 Fran\u00e7oise
    René 😆 Fran\u00e7oise
    René 😆 Françoise
    

    IMPORTANT UPDATE: I altered the initial regex above -- the one for matching surrogate-pair escape sequences -- to ensure that the two escapes both start with the hex-digit "d" start with "d8" (for the first surrogate) and "d[c-f]" (for the second surrogate); this avoids misfiring on cases where two non-surrogate characters happen to appear next to each other in the data.

      I still think it's likely the OP is taking the wrong approach by trying to hand-roll a JSON decoder, so I did want to point out a few issues with your code - hopefully thereby also pointing out some of the pitfalls of hand-rolled approaches.

      • High surrogates range from U+D800 to U+DBFF, which your regex doesn't cover (e.g. unpack("H*", encode("UTF-16BE", "\N{VARIATION SELECTOR-256}")) eq "db40ddef").
      • Your regexes should probably also handle uppercase hex digits.
      • You might want to pass Encode::FB_CROAK to decode.
      • You don't need to loop over the strings with a regex and then a second regex, that's fairly inefficient; it can all be done in one regex.

      This incorrectly handles

      { "a": "\\u2660" }
      I'm just coming back from having been away from Perl for awhile, and this was a very challenging problem that allowed me to learn some useful stuff -- so thank you very much for posting the question.
      Thank you for writing out your code. I too learned useful stuff from studying your code. :)
Re: How to Use Pack to Convert UTF-16 Surrogate Pairs to UTF-8?
by NERDVANA (Deacon) on Jun 09, 2022 at 01:01 UTC
    If you don't know the encoding of your input, a cheap hack to "fix it" is utf8::decode($string); Call it multiple times if you think the input might be multiply utf8 encoded. Strictly speaking, this is wrong, and could damage real unicode strings that happen to look like UTF8 sequences. Practically speaking, it just "fixes things" and you can get on with the rest of your work.

    The only *correct* way to decode things is to know the encoding that was given to your program, then use the Encode module. BTW, the Encode module is a core perl module, and not something you should try to avoid.

    As a sidenote, I would use chr(hex $1) instead of pack("U", hex($1))

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11144531]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (2)
As of 2024-04-19 01:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found