in reply to Yet another Encoding issue...

   It seems very strange to me that we start off with four characters in "café" and seem to get to five with "caf%C3%A9" which gets decoded as five characters...

%C3%A9 is the html url encoding for é

perl -we '$_="caf%C3%A9"; s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1 +))/eg; print "$_\n"'
outputs café for me.

Replies are listed 'Best First'.
Re^2: Yet another Encoding issue...
by Bod (Parson) on Jun 01, 2024 at 20:54 UTC
    outputs café for me

    Thanks...I've just tried that over SSH on the server and I get the correct output.

    So, I suppose that means the code to decode the URI Encoded characters is working and I need to look somewhere else!

    Any suggestions why it would work at the command line but not when sent to a browser?

      It looks like your é is the result of printing the encoded utf8 of é. You needed to print the decoded value. For example:
      perl -we 'use Encode; $c = encode("UTF-8", "é"); $dc = decode("UTF-8", + $c); print "\$c = $c \$dc = $dc\n"'
      outputs: $c = é $dc = é

        That makes sense thanks!

        $reply->{'response'} = decode('UTF-8', $data{'userChat'}); seems to have done the trick on the test script...

        So that's one problem solved. It seems I'm now getting encoding problems from AI::Chat, but only when I call the chat method, not when I call the prompt method. But that doesn't make a lot of sense as prompt uses chat...

        I'll have to try and simplify the code and see if I can reproduce it!

        Update:

        This is the sort of thing I'm getting back from AI::Chat

        {response: 'Ã\x83Â\x96zÃ\x83¼r dilerim, belki de sorumu yanlÃ\x84±Ã\ +x85Â\x9F s…rirken en keyif aldÃ\x84±Ã\x84Â\x9FÃ\x84±nÃ\x84±z Ã\x85 +Â\x9Fey ne?\n'}

        Another Update:

        { correction: 'Turkce alfabe oldukca turaf.\n\nThe correct sentence shou +ld be: "Türkçe alfabesi oldukça tuhaf."\n\nExplanation:\n1. The word +"Türkçe" is not capitalized, it should be as it's a proper noun.\n2. +The word "alfabe" is also missing its possessive suffix, it should be + "alfabesi" to show that it belongs to Turkish language.\n3. The word + "turaf" is not a word in Turkish. The correct word meaning "strange" + or "weird" is "tuhaf".', response: 'Evet, Türk alfabesi Latin alfabesine dayanır ve 29 harf +ten oluşur. Her harfin belirli bir sesi temsil ettiği +ni biliyor muydun?' }

        The correction comes from the prompt method and the characters display correctly whereas the response comes from chat and the is unreadable...

      I was curious how UTF-8 was converting a sequence of bytes to a code point that wasn't obviously related to the values of those bytes. With the help of UTF-8#Examples, here is how %C3%A9 (é) is converted to the code point 233. The bits for %C3 and %A9 are 11000011 and 10101001 (195 and 169). The first 4 bits of the first byte tells how many bytes are used for this character. In this case the first 110 means two bytes are used (1110 would mean 3 bytes etc). For two byte encodings the last 5 bits of the first byte are used for the higher order bits of the code point so (00011). The leading 1 and 0 bits (10) of the second byte are used to indicate that this is a continuation byte. The rest (101001) is used for the code point. So we end up with 00011 101001; printf "%s\n", 0b00011101001 gives 233.