Re^2: Yet another Encoding issue...

Replies are listed 'Best First'.
Re^3: Yet another Encoding issue... by Danny (Chaplain) on Jun 01, 2024 at 21:00 UTC
It looks like your Ã© is the result of printing the encoded utf8 of é. You needed to print the decoded value. For example: `perl -we 'use Encode; $c = encode("UTF-8", "é"); $dc = decode("UTF-8", + $c); print "\$c = $c \$dc = $dc\n"'` [download] outputs: $c = Ã© $dc = é	[reply] [d/l]
Re^4: Yet another Encoding issue... by Bod (Parson) on Jun 01, 2024 at 21:48 UTC
That makes sense thanks! `$reply->{'response'} = decode('UTF-8', $data{'userChat'});` seems to have done the trick on the test script... So that's one problem solved. It seems I'm now getting encoding problems from AI::Chat, but only when I call the `chat` method, not when I call the `prompt` method. But that doesn't make a lot of sense as `prompt` uses `chat`... I'll have to try and simplify the code and see if I can reproduce it! Update: This is the sort of thing I'm getting back from AI::Chat `{response: 'Ã\x83Â\x96zÃ\x83Â¼r dilerim, belki de sorumu yanlÃ\x84Â±Ã\ +x85Â\x9F s…rirken en keyif aldÃ\x84Â±Ã\x84Â\x9FÃ\x84Â±nÃ\x84Â±z Ã\x85 +Â\x9Fey ne?\n'}` [download] Another Update: { correction: 'Turkce alfabe oldukca turaf.\n\nThe correct sentence shou +ld be: "Türkçe alfabesi oldukça tuhaf."\n\nExplanation:\n1. The word +"Türkçe" is not capitalized, it should be as it's a proper noun.\n2. +The word "alfabe" is also missing its possessive suffix, it should be + "alfabesi" to show that it belongs to Turkish language.\n3. The word + "turaf" is not a word in Turkish. The correct word meaning "strange" + or "weird" is "tuhaf".', response: 'Evet, TÃ¼rk alfabesi Latin alfabesine dayanÃ„Â±r ve 29 harf +ten oluÃ…Âur. Her harfin belirli bir sesi temsil ettiÃ„Âi +ni biliyor muydun?' } [download] The `correction` comes from the `prompt` method and the characters display correctly whereas the `response` comes from `chat` and the is unreadable...	[reply] [d/l] [select]
Re^5: Yet another Encoding issue... by Danny (Chaplain) on Jun 01, 2024 at 22:02 UTC
By the way, depending on what charset you are specifying in your html you may get problems. For example, the little CGI script: `#!/bin/bash echo "Content-Type: text/html" echo "" perl -we 'use Encode; $c = encode("UTF-8", "é"); $dc = decode("UTF-8", + $c); print "\$c = $c \$dc = $dc\n"'` [download] displays as: $c = ÃƒÂ© $dc = Ã© But if you fix the encoding like: `#!/bin/bash echo "Content-Type: text/html; charset=UTF-8" echo "" perl -we 'use Encode; $c = encode("UTF-8", "é"); $dc = decode("UTF-8", + $c); print "\$c = $c \$dc = $dc\n"'` [download] it displays as: $c = Ã© $dc = é	[reply] [d/l] [select]
Re^6: Yet another Encoding issue... by Bod (Parson) on Jun 01, 2024 at 22:06 UTC
Re^3: Yet another Encoding issue... by Danny (Chaplain) on Jun 14, 2024 at 18:13 UTC
I was curious how UTF-8 was converting a sequence of bytes to a code point that wasn't obviously related to the values of those bytes. With the help of UTF-8#Examples, here is how %C3%A9 (é) is converted to the code point 233. The bits for %C3 and %A9 are 11000011 and 10101001 (195 and 169). The first 4 bits of the first byte tells how many bytes are used for this character. In this case the first 110 means two bytes are used (1110 would mean 3 bytes etc). For two byte encodings the last 5 bits of the first byte are used for the higher order bits of the code point so (00011). The leading 1 and 0 bits (10) of the second byte are used to indicate that this is a continuation byte. The rest (101001) is used for the code point. So we end up with 00011 101001; `printf "%s\n", 0b00011101001` gives 233.	[reply] [d/l]