I'm using AI::Chat to create a Turkish practice, AI-Powered chat. The first part is for the AI to analyse the Turkish supplier by the user (me) and check it for errors. Because Turkish uses some non-latin characters in the alphabet, this has created another character encoding issue for me. To eliminate the OpenAI API and AI::Chat, I have created this test script that demonstrates the issue...(no apologies for inline CSS marto - this is a quick and dirty test script!)

#!/usr/bin/perl use CGI::Carp qw(fatalsToBrowser); use lib "$ENV{'DOCUMENT_ROOT'}/cgi-bin"; use JSON; use utf8; use incl::HTMLtest; use AI::Chat; use strict; use warnings; if ($data{'userChat'}) { my $reply = {}; $reply->{'response'} = $data{'userChat'}; print "Content-type: application/json\n\n"; print encode_json $reply; exit; } print<<"END_HTML"; Content-type: text/html; charset=UTF-8 <html> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1" /> <head> <script> function sendChat() { if (document.getElementById('userChat').innerText.length > 2) { fetch('?userChat=' + encodeURIComponent(document.getElementByI +d('userChat').innerText)) .then((resp) => resp.json()) .then((json) => { document.getElementById('chatBox').innerHTML += '<div +class="textResponse">' + json.response + '</div>'; document.getElementById('userChat').innerText = ''; }); } } </script> </head> <body> <div id="chatBox" style="border:solid thin blue;min-height:100px"></di +v> <div id="userChat" contenteditable="true" style="border:solid thin gre +y"></div> <input type="button" value="send" onClick="sendChat();"> </body> </html> END_HTML

The incl::HTML module (here renamed to incl::HTMLtest) takes the URL query string and splits it up into key value pairs that it puts into %data

In this minimalistic script, text is entered into <div id="userChat"> and sent back to the Perl script when the button is clicked. This uses the fetch API. The content is in $data{'userChat'} which is just sent back as a very simple JSON object to be written into <div id="chatBox">.

This works as expected until we introduce non-latin characters - for example "café" which gets displayed as "cafĂ©"

I've captured the query string before decoding and it is "userChat=caf%C3%A9"

It seems very strange to me that we start off with four characters in "café" and seem to get to five with "caf%C3%A9" which gets decoded as five characters...

The code that does the decoding in incl::HTML looks like this. I cannot recall where it came from but it has been working for many, many years and has definitely handled Turkish characters in the past under Perl v5.16.3. I wonder if it is failing after the change to Perl v5.36.0

my @pairs = split /&/, $query_string; foreach my $p(@pairs) { $p =~ tr/+/ /; $p =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; my ($key, $val) = split /=/, $p, 2; $data{$key} = $val; }

I am beginning to think that I will never understand this mysterious world of character encodings...then I remember that for many, many years references, especially hashrefs were a total mystery to me and now I use them without having to think too hard about it. This is in no small part thanks to the Monastery and I'm hoping a similar magical revelation might be bestowed on me for character encoding! Everything was so much easier when all we had was ASCII!


In reply to Yet another Encoding issue... by Bod

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.