dcunningham has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed monks, I have a problem with encoding of an emoji in perl. The string "Test 😀" is read from a MySQL database table which has encoding latin1. Obviously that's not a UTF-8 character set, but my console seems smart enough to detect the intended output, as a plain "select * from table" on the "mysql" client displays the "grinning face" emoji correctly.

Then my perl (version 5.26) program logs the text to a log file, and again running "tail -f" on the log file displays the emoji in the text correctly. I also log the bytes using sprintf( "%vX", $text) and it prints "54.65.73.74.20.F0.9F.98.80". So the bytes for the emoji are there, in "F0.9F.98.80".

Then the text is JSON encoded (using the JSON library), and sent using $conn->send_utf8() to a websocket client using Net::WebSocket::Server, however the websocket client (running in a web browser) receives "Test 😀". I've tried encode( 'UTF-8', $text ) which did not fix the problem.

The whole subject of character encoding is not an easy one, and mixing MySQL with Perl with websockets (with JSON for good measure) has made it tricky to tell where the problem is.

Can anyone help find why the websocket client doesn't receive the emoji correctly please?

Replies are listed 'Best First'.
Re: Encoding of emoji character
by soonix (Chancellor) on Jun 20, 2022 at 06:25 UTC
    The string "Test 😀" is read from a MySQL database table which has encoding latin1.
    IMO that's the (or at least one) source of the problem. Latin1 has no code point for 😀…

    besides that, Anonymous Monk's reply seems to be spot on.

Re: Encoding of emoji character
by Anonymous Monk on Jun 20, 2022 at 06:05 UTC
    54.65.73.74.20.F0.9F.98.80
    These bytes are UTF-8 for "Test 😀".
    "Test 😀".
    This is an example of double-encoding: the bytes that were already UTF-8 were interpreted as if they were extended Latin-1 (to be precise, as if they were Unicode code points) and encoded into UTF-8 once again. You can see that by performing an inverse transformation:
    $ echo "Test 😀" | iconv -t cp1252
    Test 😀
    

    (sorry, <code> tags eat Unicode...)

    The real fix would be to check the documentation of the $conn->send_utf8() method. If if unconditionally encodes its input from Perl wide characters into UTF-8 bytes, you can decode the UTF-8 into wide characters before passing the resulting data structure to send_utf8.
      Using decode( 'UTF-8', $text ) is something I tried, but it then dies with an error:
      Wide character at /path/to/websocket_server line 680.
      Should I be doing something different to decode what's already UTF-8? Thank you.

        That "wide character" might be not the smilie, but one (actually, three) of the bytes it is encoded with.
        "F0.9F.98.80" is what sprintf( "%vX", $text) would output for e.g. $text = "\N{LATIN SMALL LETTER ETH}\x9F\x98\x80"; but for $text = "\N{GRINNING FACE}" it should instead show "1F600".

        a) What if instead of sprintf( "%vX", $text) (or additionally) try
        { use charnames ':full'; use feature 'say'; for my $c ( split //, $text ) { say Dumper $c, ord $c, charnames::viacode( ord $c ); } }
        b) You could feed "Test \N{GRINNING FACE}" to your test program (for Perl older than 5.16, you need an explicit use charnames; for the \N escape to work).

        I suspect that your console output accidentally uses the same (wrong) encoding as the database input, so it looks right…

        The main problem of handling unicode is giving code the type of input it expects. Or in other words, keeping track of whether you have a string of bytes or a string of unicode characters.

        Perl's MySQL driver is smart enough to know when you are reading a column of characters or a column of bytes. In your case, you are breaking that assumption by having UTF-8 bytes stored in a column that is declared as Latin-1.

        The most correct way to fix this is to change the declared type of the database, but as you already know, this can be hard and may have lots of other consequences.

        The most correct workaround for the database being declared as the wrong type is to convert the string on the line right after the data enters your program. So, instead of fixing it inside DBI, fix it immediately after DBI.

        ... my $data= $dbh->selectall_arrayref($sql, { Slice => {} }); for (@$data) { # This is a workaround for the incorrect # character set declared on the table. utf8::decode($_->{my_textcol}); } ...

        Leave a comment for your future self or the one who comes after you. Note that I used utf8::decode. This one is built into perl, modifies its argument, and returns false (and doesn't die) if it fails. This way, in case someone in the future does fix the database, then it won't start crashing on you here.

        Now you will have true Unicode characters in ->{my_textcol} which you can pass around to anything that expects Unicode strings. This does NOT include any of the system print/write functions. All system input/output must be bytes and will complain or die if you give them wide characters. I presume that 'send_utf8' is a method that expects unicode as input and encodes it to utf8 before writing it to the websocket.

        Also, note that your original diagnosis of "it works in the terminal" happened because you read the UTF8 bytes from the database, then wrote them to a terminal that expected to be seeing UTF8 bytes. If you fix your database encoding, the strings would come in as Characters and then generate warnings when you print because you are writing wide characters to STDOUT. Long story short, you just always need to keep track of whether your variables are holding characters or bytes throughout the flow of the program, and knowingly convert them to the correct form before I/O or text processing.

        Wide character at /path/to/websocket_server line 680.
        Should I be doing something different to decode what's already UTF-8? Thank you.
        Well, according to the documentation, wide characters are exactly the thing you're supposed to be passing to send_utf8 (the function is a one liner that sends Encode::encode('UTF-8', $_[1])), but I don't have enough information about your code to give you advice on what to try next. The error must be somewhere around /path/to/websocket_server line 680.

        You can also try to send bytes as-is using $conn->send_binary, but that may need changes on the client side.