in reply to Yet another Encoding issue...

I've solved the two issues...I'm putting the solutions here for the benefit of anyone who comes this way again with a similar issue.

Decoding the user input has been solved with Encode as pointed out by Danny in Re^3: Yet another Encoding issue...

The apparent issue with the output from AI::Chat was that it was being fed the wrong encoding. Using the decode method from Encode at the point the chat history is pulled out of the database helped. But the problem reappeared as the chat went on. So I looked more closely at the MariaDB database encoding.

The table that stores the chat history was encoded as utf8. I changed it to utf8mb4 and suddenly all the encoding issues seem to have gone away 😊

Replies are listed 'Best First'.
Re: Solved... (was: Re: Yet another Encoding issue...)
by Danny (Chaplain) on Jun 02, 2024 at 13:59 UTC
    It looks like mysql utf8 (alias for utf8mb3) uses up to 3 bytes while utf8mb4 uses up to 4 bytes. It might be an interesting exercise to figure out what characters were not fitting into 3 bytes. It seems that utf8mb3 uses code point values from 0 to 65535. I guess you could look for ord($char) > 65535.

      The standard Turkish characters that are not in the English alphabet are Ç ç Ğ ğ İ ı Ö ö Ş ş Ü ü

      It's strangely interesting that the AI generates more spurious characters the more incorrectly encoded characters are fed to it. I wonder if it tries to guess the encoding and gets confused.

      When I click 'preview' here in PM, the text is partly converted to HTML entities - that's probably the characters that were causing the issue.

      Ç ç Ğ ğ İ ı Ö ö Ş ş Ü ü

        Very interesting reading your nodes Bod! I've been working on something similar (multilingual/unicode+chatgpt) and haven't quite got to the bottom of the encoding yet... I'm trying to get things round-trip safe, but it's tricky because the API sometimes returns unicode and sometimes iso-8859-1. I've looked through the headers and I can't see any indication that the API knows which it is sending. I think I'm going to have to check the response each time and see what encoding it seems to have used.

        I think etj must be right - it's been trained on unlabelled mixed encodings, so it can't distinguish. For example, I recently asked it about a couple of new emojis in the latest unicode release, and it completely mis-identified them.

        Here's the script I'm using to test:

        #!/usr/bin/env perl use Modern::Perl; use 5.028; use LWP::UserAgent qw<>; use JSON qw<>; use Encode qw<>; # demonstrate a problem with sending/receiving utf8 to openai binmode STDOUT, ":encoding(UTF-8)"; binmode STDIN, ":encoding(UTF-8)"; binmode STDERR, ":encoding(UTF-8)"; my @tests = ( # ['plain ascii' => 'word' ], #boring # ['unencoded utf8' => "k\N{LATIN SMALL LETTER U WITH DIAERESIS}nst +liche Intelligenz"], # times out ['encoded utf8' => Encode::encode('UTF-8', "k\N{LATIN SMALL LETTER + U WITH DIAERESIS}nstliche Intelligenz")], # ['encoded iso-8859-1' => Encode::encode('iso-8859-1', "k\N{LATIN +SMALL LETTER U WITH DIAERESIS}nstliche Intelligenz")], #timeout # ['unencoded utf8 emoji' => "\N{ROBOT FACE}"], # causes FATAL wide +-character or "HTTP::Message content must be bytes" ['encoded utf8 emoji' => Encode::encode('UTF-8', "\N{ROBOT FACE}") +], ); main: { my $ua = LWP::UserAgent->new; $ua->timeout(20); # Set low at the risk of false negatives. for my $test (@tests) { my $string = $test->[1]; say "---" x 30 . "\nTest: $test->[0]"; say "Input string raw: $string"; say "Input string decoded: " . Encode::decode('UTF-8', $string +); my $response = &chatgpt_echo($ua, $string) or next; say "Response raw: $response"; say "Response decoded (UTF-8): " . (eval { Encode::decode('UTF-8', $response) } // "DECODE ER +ROR: $@"); say "Response decoded (iso-8859-1): " . (eval { Encode::decode('iso-8859-1', $response) } // "DECO +DE ERROR: $@"); } } sub chatgpt_echo { my ($ua, $string) = @_; my $model = $ENV{OPENAI_COMPLETION_MODEL} // 'gpt-4o'; my $endpoint = 'https://api.openai.com/v1/chat/completions'; my $token = $ENV{OPENAI_API_KEY} or die "set the environment varia +ble OPENAI_API_KEY\neg:\nOPEN_AI_KEY=asdfasdfasdf perl $0"; my $system_prompt = 'This is a test. Echo the exact input in your +response. No other output in the response.'; my $json_payload = JSON::to_json({ model => $model, messages => [ { role => 'system', content => $system_prompt }, { role => 'user', content => $string }, ] }); my $request = HTTP::Request->new(POST => $endpoint); $request->header("Authorization" => "Bearer $token"); $request->header("Content-Type" => "application/json"); eval { $request->content($json_payload) }; if($@) { print "Error setting content: $@"; return; } my $response = $ua->request($request); if( $response->is_success ) { my $decoded = JSON::decode_json($response->decoded_content); return $decoded->{choices}[0]{message}{content}; } say "Response ERROR: " . $response->message; return; }


        - Boldra
        So-called "AI" has to tokenise its input from a stream of characters, into integers. This isn't restricted to breaking by word, but can do by letter, or indeed byte.

        I would expect the problem here is because the ChatGPT's tokenisation, being made by Americans, isn't careful to avoid splitting across UTF-8 characters. That will make any training based on such tokens susceptible to errors.