in reply to Re: Yet another Encoding issue...
in thread Yet another Encoding issue...

outputs café for me

Thanks...I've just tried that over SSH on the server and I get the correct output.

So, I suppose that means the code to decode the URI Encoded characters is working and I need to look somewhere else!

Any suggestions why it would work at the command line but not when sent to a browser?

Replies are listed 'Best First'.
Re^3: Yet another Encoding issue...
by Danny (Chaplain) on Jun 01, 2024 at 21:00 UTC
    It looks like your é is the result of printing the encoded utf8 of é. You needed to print the decoded value. For example:
    perl -we 'use Encode; $c = encode("UTF-8", "é"); $dc = decode("UTF-8", + $c); print "\$c = $c \$dc = $dc\n"'
    outputs: $c = é $dc = é

      That makes sense thanks!

      $reply->{'response'} = decode('UTF-8', $data{'userChat'}); seems to have done the trick on the test script...

      So that's one problem solved. It seems I'm now getting encoding problems from AI::Chat, but only when I call the chat method, not when I call the prompt method. But that doesn't make a lot of sense as prompt uses chat...

      I'll have to try and simplify the code and see if I can reproduce it!

      Update:

      This is the sort of thing I'm getting back from AI::Chat

      {response: 'Ã\x83Â\x96zÃ\x83¼r dilerim, belki de sorumu yanlÃ\x84±Ã\ +x85Â\x9F s…rirken en keyif aldÃ\x84±Ã\x84Â\x9FÃ\x84±nÃ\x84±z Ã\x85 +Â\x9Fey ne?\n'}

      Another Update:

      { correction: 'Turkce alfabe oldukca turaf.\n\nThe correct sentence shou +ld be: "Türkçe alfabesi oldukça tuhaf."\n\nExplanation:\n1. The word +"Türkçe" is not capitalized, it should be as it's a proper noun.\n2. +The word "alfabe" is also missing its possessive suffix, it should be + "alfabesi" to show that it belongs to Turkish language.\n3. The word + "turaf" is not a word in Turkish. The correct word meaning "strange" + or "weird" is "tuhaf".', response: 'Evet, Türk alfabesi Latin alfabesine dayanır ve 29 harf +ten oluşur. Her harfin belirli bir sesi temsil ettiği +ni biliyor muydun?' }

      The correction comes from the prompt method and the characters display correctly whereas the response comes from chat and the is unreadable...

        By the way, depending on what charset you are specifying in your html you may get problems. For example, the little CGI script:
        #!/bin/bash echo "Content-Type: text/html" echo "" perl -we 'use Encode; $c = encode("UTF-8", "é"); $dc = decode("UTF-8", + $c); print "\$c = $c \$dc = $dc\n"'
        displays as: $c = é $dc = é

        But if you fix the encoding like:

        #!/bin/bash echo "Content-Type: text/html; charset=UTF-8" echo "" perl -we 'use Encode; $c = encode("UTF-8", "é"); $dc = decode("UTF-8", + $c); print "\$c = $c \$dc = $dc\n"'
        it displays as: $c = é $dc = é
Re^3: Yet another Encoding issue...
by Danny (Chaplain) on Jun 14, 2024 at 18:13 UTC
    I was curious how UTF-8 was converting a sequence of bytes to a code point that wasn't obviously related to the values of those bytes. With the help of UTF-8#Examples, here is how %C3%A9 (é) is converted to the code point 233. The bits for %C3 and %A9 are 11000011 and 10101001 (195 and 169). The first 4 bits of the first byte tells how many bytes are used for this character. In this case the first 110 means two bytes are used (1110 would mean 3 bytes etc). For two byte encodings the last 5 bits of the first byte are used for the higher order bits of the code point so (00011). The leading 1 and 0 bits (10) of the second byte are used to indicate that this is a continuation byte. The rest (101001) is used for the code point. So we end up with 00011 101001; printf "%s\n", 0b00011101001 gives 233.