Bod has asked for the wisdom of the Perl Monks concerning the following question:
I'm using AI::Chat to create a Turkish practice, AI-Powered chat. The first part is for the AI to analyse the Turkish supplier by the user (me) and check it for errors. Because Turkish uses some non-latin characters in the alphabet, this has created another character encoding issue for me. To eliminate the OpenAI API and AI::Chat, I have created this test script that demonstrates the issue...(no apologies for inline CSS marto - this is a quick and dirty test script!)
#!/usr/bin/perl use CGI::Carp qw(fatalsToBrowser); use lib "$ENV{'DOCUMENT_ROOT'}/cgi-bin"; use JSON; use utf8; use incl::HTMLtest; use AI::Chat; use strict; use warnings; if ($data{'userChat'}) { my $reply = {}; $reply->{'response'} = $data{'userChat'}; print "Content-type: application/json\n\n"; print encode_json $reply; exit; } print<<"END_HTML"; Content-type: text/html; charset=UTF-8 <html> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1" /> <head> <script> function sendChat() { if (document.getElementById('userChat').innerText.length > 2) { fetch('?userChat=' + encodeURIComponent(document.getElementByI +d('userChat').innerText)) .then((resp) => resp.json()) .then((json) => { document.getElementById('chatBox').innerHTML += '<div +class="textResponse">' + json.response + '</div>'; document.getElementById('userChat').innerText = ''; }); } } </script> </head> <body> <div id="chatBox" style="border:solid thin blue;min-height:100px"></di +v> <div id="userChat" contenteditable="true" style="border:solid thin gre +y"></div> <input type="button" value="send" onClick="sendChat();"> </body> </html> END_HTML
The incl::HTML module (here renamed to incl::HTMLtest) takes the URL query string and splits it up into key value pairs that it puts into %data
In this minimalistic script, text is entered into <div id="userChat"> and sent back to the Perl script when the button is clicked. This uses the fetch API. The content is in $data{'userChat'} which is just sent back as a very simple JSON object to be written into <div id="chatBox">.
This works as expected until we introduce non-latin characters - for example "café" which gets displayed as "café"
I've captured the query string before decoding and it is "userChat=caf%C3%A9"
It seems very strange to me that we start off with four characters in "café" and seem to get to five with "caf%C3%A9" which gets decoded as five characters...
The code that does the decoding in incl::HTML looks like this. I cannot recall where it came from but it has been working for many, many years and has definitely handled Turkish characters in the past under Perl v5.16.3. I wonder if it is failing after the change to Perl v5.36.0
my @pairs = split /&/, $query_string; foreach my $p(@pairs) { $p =~ tr/+/ /; $p =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; my ($key, $val) = split /=/, $p, 2; $data{$key} = $val; }
I am beginning to think that I will never understand this mysterious world of character encodings...then I remember that for many, many years references, especially hashrefs were a total mystery to me and now I use them without having to think too hard about it. This is in no small part thanks to the Monastery and I'm hoping a similar magical revelation might be bestowed on me for character encoding! Everything was so much easier when all we had was ASCII!
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Yet another Encoding issue...
by Danny (Chaplain) on Jun 01, 2024 at 20:17 UTC | |
by Bod (Parson) on Jun 01, 2024 at 20:54 UTC | |
by Danny (Chaplain) on Jun 01, 2024 at 21:00 UTC | |
by Bod (Parson) on Jun 01, 2024 at 21:48 UTC | |
by Danny (Chaplain) on Jun 01, 2024 at 22:02 UTC | |
| |
by Danny (Chaplain) on Jun 14, 2024 at 18:13 UTC | |
Solved... (was: Re: Yet another Encoding issue...)
by Bod (Parson) on Jun 02, 2024 at 12:52 UTC | |
by Danny (Chaplain) on Jun 02, 2024 at 13:59 UTC | |
by Bod (Parson) on Jun 02, 2024 at 21:45 UTC | |
by Boldra (Curate) on Jun 04, 2024 at 12:13 UTC | |
by etj (Priest) on Jun 03, 2024 at 13:39 UTC | |
Re: Yet another Encoding issue...
by etj (Priest) on Jun 02, 2024 at 14:27 UTC |