I'm using AI::Chat to create a Turkish practice, AI-Powered chat. The first part is for the AI to analyse the Turkish supplier by the user (me) and check it for errors. Because Turkish uses some non-latin characters in the alphabet, this has created another character encoding issue for me. To eliminate the OpenAI API and AI::Chat, I have created this test script that demonstrates the issue...(no apologies for inline CSS marto - this is a quick and dirty test script!)
#!/usr/bin/perl use CGI::Carp qw(fatalsToBrowser); use lib "$ENV{'DOCUMENT_ROOT'}/cgi-bin"; use JSON; use utf8; use incl::HTMLtest; use AI::Chat; use strict; use warnings; if ($data{'userChat'}) { my $reply = {}; $reply->{'response'} = $data{'userChat'}; print "Content-type: application/json\n\n"; print encode_json $reply; exit; } print<<"END_HTML"; Content-type: text/html; charset=UTF-8 <html> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1" /> <head> <script> function sendChat() { if (document.getElementById('userChat').innerText.length > 2) { fetch('?userChat=' + encodeURIComponent(document.getElementByI +d('userChat').innerText)) .then((resp) => resp.json()) .then((json) => { document.getElementById('chatBox').innerHTML += '<div +class="textResponse">' + json.response + '</div>'; document.getElementById('userChat').innerText = ''; }); } } </script> </head> <body> <div id="chatBox" style="border:solid thin blue;min-height:100px"></di +v> <div id="userChat" contenteditable="true" style="border:solid thin gre +y"></div> <input type="button" value="send" onClick="sendChat();"> </body> </html> END_HTML
The incl::HTML module (here renamed to incl::HTMLtest) takes the URL query string and splits it up into key value pairs that it puts into %data
In this minimalistic script, text is entered into <div id="userChat"> and sent back to the Perl script when the button is clicked. This uses the fetch API. The content is in $data{'userChat'} which is just sent back as a very simple JSON object to be written into <div id="chatBox">.
This works as expected until we introduce non-latin characters - for example "café" which gets displayed as "cafĂ©"
I've captured the query string before decoding and it is "userChat=caf%C3%A9"
It seems very strange to me that we start off with four characters in "café" and seem to get to five with "caf%C3%A9" which gets decoded as five characters...
The code that does the decoding in incl::HTML looks like this. I cannot recall where it came from but it has been working for many, many years and has definitely handled Turkish characters in the past under Perl v5.16.3. I wonder if it is failing after the change to Perl v5.36.0
my @pairs = split /&/, $query_string; foreach my $p(@pairs) { $p =~ tr/+/ /; $p =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; my ($key, $val) = split /=/, $p, 2; $data{$key} = $val; }
I am beginning to think that I will never understand this mysterious world of character encodings...then I remember that for many, many years references, especially hashrefs were a total mystery to me and now I use them without having to think too hard about it. This is in no small part thanks to the Monastery and I'm hoping a similar magical revelation might be bestowed on me for character encoding! Everything was so much easier when all we had was ASCII!
In reply to Yet another Encoding issue... by Bod
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |