Re^2: Encoding of emoji character

Replies are listed 'Best First'.
Re^3: Encoding of emoji character by soonix (Chancellor) on Jun 20, 2022 at 11:58 UTC
That "wide character" might be not the smilie, but one (actually, three) of the bytes it is encoded with. "F0.9F.98.80" is what `sprintf( "%vX", $text)` would output for e.g. `$text = "\N{LATIN SMALL LETTER ETH}\x9F\x98\x80";` but for `$text = "\N{GRINNING FACE}"` it should instead show "1F600". a) What if instead of `sprintf( "%vX", $text)` (or additionally) try `{ use charnames ':full'; use feature 'say'; for my $c ( split //, $text ) { say Dumper $c, ord $c, charnames::viacode( ord $c ); } }` [download] b) You could feed "Test \N{GRINNING FACE}" to your test program (for Perl older than 5.16, you need an explicit `use charnames;` for the \N escape to work). I suspect that your console output accidentally uses the same (wrong) encoding as the database input, so it looks right…	[reply] [d/l] [select]
Re^4: Encoding of emoji character by choroba (Cardinal) on Jun 20, 2022 at 13:04 UTC
> for Perl older than 5.16, you need an explicit use charnames; for the \N escape to work Thanks. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re^4: Encoding of emoji character by dcunningham (Sexton) on Jun 20, 2022 at 21:27 UTC
Thanks for that. If I use `$text = "Test \N{GRINNING FACE}"` in my program then the websocket client displays the emoji correctly. I modified your sample code a little and it gave the following output. With the `$text = "Test \N{GRINNING FACE}"` for the emoji it gives: `$VAR1 = "\x{1f600}"; 128512 GRINNING FACE` [download] Using the string read from MySQL the emoji gives: `240 LATIN SMALL LETTER ETH $VAR1 = '�'; 159 APPLICATION PROGRAM COMMAND $VAR1 = '�'; 152 START OF STRING $VAR1 = '�'; 128 PADDING CHARACTER` [download] Changing the database to UTF-8 will be difficult as it's not entirely under our control. Do you think the table being latin1 is the problem?	[reply] [d/l] [select]
Re^5: Encoding of emoji character by soonix (Chancellor) on Jun 21, 2022 at 18:47 UTC
I do think the table being latin1 is a part of the problem. On the other hand, the application that fills the table seems to use a reasonable encoding (UTF-8). If you change the table to something unicodey, that application most probably will NOT automagically insert a unicode character instead of the current 4 bytes. Probably the easier solution will be to check for bytes between 0x80 and 0x9F (because these are not defined for ISO 8859-1, the "official" Latin1). If they are not used otherwise in your variant of Latin1, it might be feasible to try it with Encode::decode. What happens, if you insert something like `{ use Encode qw(decode :fallbacks); $text = decode('UTF-8', $text, FB_WARN); }` [download] after reading $text from the database?	[reply] [d/l]
Re^6: Encoding of emoji character by dcunningham (Sexton) on Jun 22, 2022 at 03:37 UTC
Re^5: Encoding of emoji character by Anonymous Monk on Jun 21, 2022 at 06:09 UTC
Interesting. None of these characters have code points above 255, and yet you sometimes get the error in `decode($text)`. You said the table encoding is latin-1. My current guess is, you get your information decoded as if it was latin1. Most of it looks like bytes, but occasionally, latin-1 text decodes to wide characters and blows up `decode` (which only expects bytes). What if you encode `$text` back to latin-1 to get bytes, then decode those as UTF-8? This transformation seems to be reversible as long as all bytes round-trip, that is, MySQL's interpretation of "latin-1" is the same as Perl's and has a meaning for all 256 possible byte values.	[reply] [d/l] [select]
Re^6: Encoding of emoji character by soonix (Chancellor) on Jun 21, 2022 at 18:52 UTC
Re^6: Encoding of emoji character by dcunningham (Sexton) on Jun 22, 2022 at 03:40 UTC
Re^3: Encoding of emoji character by NERDVANA (Priest) on Jun 22, 2022 at 18:53 UTC
The main problem of handling unicode is giving code the type of input it expects. Or in other words, keeping track of whether you have a string of bytes or a string of unicode characters. Perl's MySQL driver is smart enough to know when you are reading a column of characters or a column of bytes. In your case, you are breaking that assumption by having UTF-8 bytes stored in a column that is declared as Latin-1. The most correct way to fix this is to change the declared type of the database, but as you already know, this can be hard and may have lots of other consequences. The most correct workaround for the database being declared as the wrong type is to convert the string on the line right after the data enters your program. So, instead of fixing it inside DBI, fix it immediately after DBI. `... my $data= $dbh->selectall_arrayref($sql, { Slice => {} }); for (@$data) { # This is a workaround for the incorrect # character set declared on the table. utf8::decode($_->{my_textcol}); } ...` [download] Leave a comment for your future self or the one who comes after you. Note that I used utf8::decode. This one is built into perl, modifies its argument, and returns false (and doesn't die) if it fails. This way, in case someone in the future does fix the database, then it won't start crashing on you here. Now you will have true Unicode characters in ->{my_textcol} which you can pass around to anything that expects Unicode strings. This does NOT include any of the system print/write functions. All system input/output must be bytes and will complain or die if you give them wide characters. I presume that 'send_utf8' is a method that expects unicode as input and encodes it to utf8 before writing it to the websocket. Also, note that your original diagnosis of "it works in the terminal" happened because you read the UTF8 bytes from the database, then wrote them to a terminal that expected to be seeing UTF8 bytes. If you fix your database encoding, the strings would come in as Characters and then generate warnings when you `print` because you are writing wide characters to STDOUT. Long story short, you just always need to keep track of whether your variables are holding characters or bytes throughout the flow of the program, and knowingly convert them to the correct form before I/O or text processing.	[reply] [d/l] [select]
Re^3: Encoding of emoji character by Anonymous Monk on Jun 20, 2022 at 09:50 UTC
`Wide character at /path/to/websocket_server line 680.` [download] Should I be doing something different to decode what's already UTF-8? Thank you. Well, according to the documentation, wide characters are exactly the thing you're supposed to be passing to `send_utf8` (the function is a one liner that sends `Encode::encode('UTF-8', $_[1])`), but I don't have enough information about your code to give you advice on what to try next. The error must be somewhere around /path/to/websocket_server line 680. You can also try to send bytes as-is using `$conn->send_binary`, but that may need changes on the client side.	[reply] [d/l] [select]
Re^4: Encoding of emoji character by dcunningham (Sexton) on Jun 20, 2022 at 21:32 UTC
The line giving the "Wide character" error is the actual decode() line itself. Perhaps $text isn't actually already in UTF-8? But then what encoding it is in I don't know.	[reply]