Re: Yet another Encoding issue...

Replies are listed 'Best First'.
Re^2: Yet another Encoding issue... by Bod (Parson) on Jun 01, 2024 at 20:54 UTC
outputs café for me Thanks...I've just tried that over SSH on the server and I get the correct output. So, I suppose that means the code to decode the URI Encoded characters is working and I need to look somewhere else! Any suggestions why it would work at the command line but not when sent to a browser?	[reply]
Re^3: Yet another Encoding issue... by Danny (Chaplain) on Jun 01, 2024 at 21:00 UTC
It looks like your Ă© is the result of printing the encoded utf8 of é. You needed to print the decoded value. For example: `perl -we 'use Encode; $c = encode("UTF-8", "é"); $dc = decode("UTF-8", + $c); print "\$c = $c \$dc = $dc\n"'` [download] outputs: $c = Ă© $dc = é	[reply] [d/l]
Re^4: Yet another Encoding issue... by Bod (Parson) on Jun 01, 2024 at 21:48 UTC
That makes sense thanks! `$reply->{'response'} = decode('UTF-8', $data{'userChat'});` seems to have done the trick on the test script... So that's one problem solved. It seems I'm now getting encoding problems from AI::Chat, but only when I call the `chat` method, not when I call the `prompt` method. But that doesn't make a lot of sense as `prompt` uses `chat`... I'll have to try and simplify the code and see if I can reproduce it! Update: This is the sort of thing I'm getting back from AI::Chat `{response: 'Ă\x83Â\x96zĂ\x83ÂĽr dilerim, belki de sorumu yanlĂ\x84Â±Ă\ +x85Â\x9F s…rirken en keyif aldĂ\x84Â±Ă\x84Â\x9FĂ\x84Â±nĂ\x84Â±z Ă\x85 +Â\x9Fey ne?\n'}` [download] Another Update: { correction: 'Turkce alfabe oldukca turaf.\n\nThe correct sentence shou +ld be: "Türkçe alfabesi oldukça tuhaf."\n\nExplanation:\n1. The word +"Türkçe" is not capitalized, it should be as it's a proper noun.\n2. +The word "alfabe" is also missing its possessive suffix, it should be + "alfabesi" to show that it belongs to Turkish language.\n3. The word + "turaf" is not a word in Turkish. The correct word meaning "strange" + or "weird" is "tuhaf".', response: 'Evet, TĂĽrk alfabesi Latin alfabesine dayanĂ„Â±r ve 29 harf +ten oluĂ…Âur. Her harfin belirli bir sesi temsil ettiĂ„Âi +ni biliyor muydun?' } [download] The `correction` comes from the `prompt` method and the characters display correctly whereas the `response` comes from `chat` and the is unreadable...	[reply] [d/l] [select]
Re^5: Yet another Encoding issue... by Danny (Chaplain) on Jun 01, 2024 at 22:02 UTC
Re^6: Yet another Encoding issue... by Bod (Parson) on Jun 01, 2024 at 22:06 UTC
Re^3: Yet another Encoding issue... by Danny (Chaplain) on Jun 14, 2024 at 18:13 UTC
I was curious how UTF-8 was converting a sequence of bytes to a code point that wasn't obviously related to the values of those bytes. With the help of UTF-8#Examples, here is how %C3%A9 (é) is converted to the code point 233. The bits for %C3 and %A9 are 11000011 and 10101001 (195 and 169). The first 4 bits of the first byte tells how many bytes are used for this character. In this case the first 110 means two bytes are used (1110 would mean 3 bytes etc). For two byte encodings the last 5 bits of the first byte are used for the higher order bits of the code point so (00011). The leading 1 and 0 bits (10) of the second byte are used to indicate that this is a continuation byte. The rest (101001) is used for the code point. So we end up with 00011 101001; `printf "%s\n", 0b00011101001` gives 233.	[reply] [d/l]