in reply to Re^2: Solved... (was: Re: Yet another Encoding issue...)
in thread Yet another Encoding issue...
Very interesting reading your nodes Bod! I've been working on something similar (multilingual/unicode+chatgpt) and haven't quite got to the bottom of the encoding yet... I'm trying to get things round-trip safe, but it's tricky because the API sometimes returns unicode and sometimes iso-8859-1. I've looked through the headers and I can't see any indication that the API knows which it is sending. I think I'm going to have to check the response each time and see what encoding it seems to have used.
I think etj must be right - it's been trained on unlabelled mixed encodings, so it can't distinguish. For example, I recently asked it about a couple of new emojis in the latest unicode release, and it completely mis-identified them.
Here's the script I'm using to test:
#!/usr/bin/env perl use Modern::Perl; use 5.028; use LWP::UserAgent qw<>; use JSON qw<>; use Encode qw<>; # demonstrate a problem with sending/receiving utf8 to openai binmode STDOUT, ":encoding(UTF-8)"; binmode STDIN, ":encoding(UTF-8)"; binmode STDERR, ":encoding(UTF-8)"; my @tests = ( # ['plain ascii' => 'word' ], #boring # ['unencoded utf8' => "k\N{LATIN SMALL LETTER U WITH DIAERESIS}nst +liche Intelligenz"], # times out ['encoded utf8' => Encode::encode('UTF-8', "k\N{LATIN SMALL LETTER + U WITH DIAERESIS}nstliche Intelligenz")], # ['encoded iso-8859-1' => Encode::encode('iso-8859-1', "k\N{LATIN +SMALL LETTER U WITH DIAERESIS}nstliche Intelligenz")], #timeout # ['unencoded utf8 emoji' => "\N{ROBOT FACE}"], # causes FATAL wide +-character or "HTTP::Message content must be bytes" ['encoded utf8 emoji' => Encode::encode('UTF-8', "\N{ROBOT FACE}") +], ); main: { my $ua = LWP::UserAgent->new; $ua->timeout(20); # Set low at the risk of false negatives. for my $test (@tests) { my $string = $test->[1]; say "---" x 30 . "\nTest: $test->[0]"; say "Input string raw: $string"; say "Input string decoded: " . Encode::decode('UTF-8', $string +); my $response = &chatgpt_echo($ua, $string) or next; say "Response raw: $response"; say "Response decoded (UTF-8): " . (eval { Encode::decode('UTF-8', $response) } // "DECODE ER +ROR: $@"); say "Response decoded (iso-8859-1): " . (eval { Encode::decode('iso-8859-1', $response) } // "DECO +DE ERROR: $@"); } } sub chatgpt_echo { my ($ua, $string) = @_; my $model = $ENV{OPENAI_COMPLETION_MODEL} // 'gpt-4o'; my $endpoint = 'https://api.openai.com/v1/chat/completions'; my $token = $ENV{OPENAI_API_KEY} or die "set the environment varia +ble OPENAI_API_KEY\neg:\nOPEN_AI_KEY=asdfasdfasdf perl $0"; my $system_prompt = 'This is a test. Echo the exact input in your +response. No other output in the response.'; my $json_payload = JSON::to_json({ model => $model, messages => [ { role => 'system', content => $system_prompt }, { role => 'user', content => $string }, ] }); my $request = HTTP::Request->new(POST => $endpoint); $request->header("Authorization" => "Bearer $token"); $request->header("Content-Type" => "application/json"); eval { $request->content($json_payload) }; if($@) { print "Error setting content: $@"; return; } my $response = $ua->request($request); if( $response->is_success ) { my $decoded = JSON::decode_json($response->decoded_content); return $decoded->{choices}[0]{message}{content}; } say "Response ERROR: " . $response->message; return; }
|
---|