It looks like mysql utf8 (alias for utf8mb3) uses up to 3 bytes while utf8mb4 uses up to 4 bytes. It might be an interesting exercise to figure out what characters were not fitting into 3 bytes. It seems that utf8mb3 uses code point values from 0 to 65535. I guess you could look for ord($char) > 65535. | [reply] |
The standard Turkish characters that are not in the English alphabet are Ç ç Ğ ğ İ ı Ö ö Ş ş Ü ü
It's strangely interesting that the AI generates more spurious characters the more incorrectly encoded characters are fed to it. I wonder if it tries to guess the encoding and gets confused.
When I click 'preview' here in PM, the text is partly converted to HTML entities - that's probably the characters that were causing the issue.
Ç ç Ğ ğ İ ı Ö ö Ş ş Ü ü
| [reply] [d/l] |
Very interesting reading your nodes Bod! I've been working on something similar (multilingual/unicode+chatgpt) and haven't quite got to the bottom of the encoding yet...
I'm trying to get things round-trip safe, but it's tricky because the API sometimes returns unicode and sometimes iso-8859-1. I've looked through the headers and I can't see any indication that the API knows which it is sending. I think I'm going to have to check the response each time and see what encoding it seems to have used.
I think etj must be right - it's been trained on unlabelled mixed encodings, so it can't distinguish. For example, I recently asked it about a couple of new emojis in the latest unicode release, and it completely mis-identified them.
Here's the script I'm using to test:
#!/usr/bin/env perl
use Modern::Perl;
use 5.028;
use LWP::UserAgent qw<>;
use JSON qw<>;
use Encode qw<>;
# demonstrate a problem with sending/receiving utf8 to openai
binmode STDOUT, ":encoding(UTF-8)";
binmode STDIN, ":encoding(UTF-8)";
binmode STDERR, ":encoding(UTF-8)";
my @tests = (
# ['plain ascii' => 'word' ], #boring
# ['unencoded utf8' => "k\N{LATIN SMALL LETTER U WITH DIAERESIS}nst
+liche Intelligenz"], # times out
['encoded utf8' => Encode::encode('UTF-8', "k\N{LATIN SMALL LETTER
+ U WITH DIAERESIS}nstliche Intelligenz")],
# ['encoded iso-8859-1' => Encode::encode('iso-8859-1', "k\N{LATIN
+SMALL LETTER U WITH DIAERESIS}nstliche Intelligenz")], #timeout
# ['unencoded utf8 emoji' => "\N{ROBOT FACE}"], # causes FATAL wide
+-character or "HTTP::Message content must be bytes"
['encoded utf8 emoji' => Encode::encode('UTF-8', "\N{ROBOT FACE}")
+],
);
main: {
my $ua = LWP::UserAgent->new;
$ua->timeout(20); # Set low at the risk of false negatives.
for my $test (@tests) {
my $string = $test->[1];
say "---" x 30 . "\nTest: $test->[0]";
say "Input string raw: $string";
say "Input string decoded: " . Encode::decode('UTF-8', $string
+);
my $response = &chatgpt_echo($ua, $string) or next;
say "Response raw: $response";
say "Response decoded (UTF-8): " .
(eval { Encode::decode('UTF-8', $response) } // "DECODE ER
+ROR: $@");
say "Response decoded (iso-8859-1): " .
(eval { Encode::decode('iso-8859-1', $response) } // "DECO
+DE ERROR: $@");
}
}
sub chatgpt_echo {
my ($ua, $string) = @_;
my $model = $ENV{OPENAI_COMPLETION_MODEL} // 'gpt-4o';
my $endpoint = 'https://api.openai.com/v1/chat/completions';
my $token = $ENV{OPENAI_API_KEY} or die "set the environment varia
+ble OPENAI_API_KEY\neg:\nOPEN_AI_KEY=asdfasdfasdf perl $0";
my $system_prompt = 'This is a test. Echo the exact input in your
+response. No other output in the response.';
my $json_payload = JSON::to_json({ model => $model, messages => [
{ role => 'system', content => $system_prompt },
{ role => 'user', content => $string },
] });
my $request = HTTP::Request->new(POST => $endpoint);
$request->header("Authorization" => "Bearer $token");
$request->header("Content-Type" => "application/json");
eval { $request->content($json_payload) };
if($@) {
print "Error setting content: $@";
return;
}
my $response = $ua->request($request);
if( $response->is_success ) {
my $decoded = JSON::decode_json($response->decoded_content);
return $decoded->{choices}[0]{message}{content};
}
say "Response ERROR: " . $response->message;
return;
}
| [reply] [d/l] |
So-called "AI" has to tokenise its input from a stream of characters, into integers. This isn't restricted to breaking by word, but can do by letter, or indeed byte.
I would expect the problem here is because the ChatGPT's tokenisation, being made by Americans, isn't careful to avoid splitting across UTF-8 characters. That will make any training based on such tokens susceptible to errors.
| [reply] |