in reply to Solved... (was: Re: Yet another Encoding issue...)
in thread Yet another Encoding issue...

It looks like mysql utf8 (alias for utf8mb3) uses up to 3 bytes while utf8mb4 uses up to 4 bytes. It might be an interesting exercise to figure out what characters were not fitting into 3 bytes. It seems that utf8mb3 uses code point values from 0 to 65535. I guess you could look for ord($char) > 65535.
  • Comment on Re: Solved... (was: Re: Yet another Encoding issue...)

Replies are listed 'Best First'.
Re^2: Solved... (was: Re: Yet another Encoding issue...)
by Bod (Parson) on Jun 02, 2024 at 21:45 UTC

    The standard Turkish characters that are not in the English alphabet are Ç ç Ğ ğ İ ı Ö ö Ş ş Ü ü

    It's strangely interesting that the AI generates more spurious characters the more incorrectly encoded characters are fed to it. I wonder if it tries to guess the encoding and gets confused.

    When I click 'preview' here in PM, the text is partly converted to HTML entities - that's probably the characters that were causing the issue.

    Ç ç Ğ ğ İ ı Ö ö Ş ş Ü ü

      Very interesting reading your nodes Bod! I've been working on something similar (multilingual/unicode+chatgpt) and haven't quite got to the bottom of the encoding yet... I'm trying to get things round-trip safe, but it's tricky because the API sometimes returns unicode and sometimes iso-8859-1. I've looked through the headers and I can't see any indication that the API knows which it is sending. I think I'm going to have to check the response each time and see what encoding it seems to have used.

      I think etj must be right - it's been trained on unlabelled mixed encodings, so it can't distinguish. For example, I recently asked it about a couple of new emojis in the latest unicode release, and it completely mis-identified them.

      Here's the script I'm using to test:

      #!/usr/bin/env perl use Modern::Perl; use 5.028; use LWP::UserAgent qw<>; use JSON qw<>; use Encode qw<>; # demonstrate a problem with sending/receiving utf8 to openai binmode STDOUT, ":encoding(UTF-8)"; binmode STDIN, ":encoding(UTF-8)"; binmode STDERR, ":encoding(UTF-8)"; my @tests = ( # ['plain ascii' => 'word' ], #boring # ['unencoded utf8' => "k\N{LATIN SMALL LETTER U WITH DIAERESIS}nst +liche Intelligenz"], # times out ['encoded utf8' => Encode::encode('UTF-8', "k\N{LATIN SMALL LETTER + U WITH DIAERESIS}nstliche Intelligenz")], # ['encoded iso-8859-1' => Encode::encode('iso-8859-1', "k\N{LATIN +SMALL LETTER U WITH DIAERESIS}nstliche Intelligenz")], #timeout # ['unencoded utf8 emoji' => "\N{ROBOT FACE}"], # causes FATAL wide +-character or "HTTP::Message content must be bytes" ['encoded utf8 emoji' => Encode::encode('UTF-8', "\N{ROBOT FACE}") +], ); main: { my $ua = LWP::UserAgent->new; $ua->timeout(20); # Set low at the risk of false negatives. for my $test (@tests) { my $string = $test->[1]; say "---" x 30 . "\nTest: $test->[0]"; say "Input string raw: $string"; say "Input string decoded: " . Encode::decode('UTF-8', $string +); my $response = &chatgpt_echo($ua, $string) or next; say "Response raw: $response"; say "Response decoded (UTF-8): " . (eval { Encode::decode('UTF-8', $response) } // "DECODE ER +ROR: $@"); say "Response decoded (iso-8859-1): " . (eval { Encode::decode('iso-8859-1', $response) } // "DECO +DE ERROR: $@"); } } sub chatgpt_echo { my ($ua, $string) = @_; my $model = $ENV{OPENAI_COMPLETION_MODEL} // 'gpt-4o'; my $endpoint = 'https://api.openai.com/v1/chat/completions'; my $token = $ENV{OPENAI_API_KEY} or die "set the environment varia +ble OPENAI_API_KEY\neg:\nOPEN_AI_KEY=asdfasdfasdf perl $0"; my $system_prompt = 'This is a test. Echo the exact input in your +response. No other output in the response.'; my $json_payload = JSON::to_json({ model => $model, messages => [ { role => 'system', content => $system_prompt }, { role => 'user', content => $string }, ] }); my $request = HTTP::Request->new(POST => $endpoint); $request->header("Authorization" => "Bearer $token"); $request->header("Content-Type" => "application/json"); eval { $request->content($json_payload) }; if($@) { print "Error setting content: $@"; return; } my $response = $ua->request($request); if( $response->is_success ) { my $decoded = JSON::decode_json($response->decoded_content); return $decoded->{choices}[0]{message}{content}; } say "Response ERROR: " . $response->message; return; }


      - Boldra
      So-called "AI" has to tokenise its input from a stream of characters, into integers. This isn't restricted to breaking by word, but can do by letter, or indeed byte.

      I would expect the problem here is because the ChatGPT's tokenisation, being made by Americans, isn't careful to avoid splitting across UTF-8 characters. That will make any training based on such tokens susceptible to errors.