in reply to Re: Perl encoding problem
in thread Perl encoding problem
if I try to insert it into a latin1 mySQL database
Any mysql from at least the last decade will support UTF8 mode. Assuming you have DBI there, the connection options should be at least:
If this isn't an option and you really need mysql to be latin1, then you need to convert the incoming data to latin1 before you insert it.connect($dsn, $user, $pass, { RaiseError => 1, AutoCommit => 1, mysql_enable_utf8 => 1, on_connect_call => 'set_strict_mode', })
use Encode qw(encode); $_ = encode('iso-8859-1', $_, 1) for values %$add;
Note that that dies if there is any character in the string that is outside latin1.
Any time you mess with encodings, you need to make a distinction of which perl scalars are holding "unicode strings" vs. which are holding "byte sequences". It would be nice if Perl tracked this for you, but it does not. (perl internally tracks whether it has wide characters in a scalar, but that is not the same as tracking the logical intent of the string) Setting an encoding on a filehandle means you are receiving "unicode strings" from it. Before you pass those strings to an API that isn't unicode-aware, you need to choose what bytes they should become. You can manually convert to the bytes of your choice using the Encode module. Or, upgrade your DBI connection to be unicode aware.
Edit: re-reading your initial post, I realize this could use more clarification. You were previously loading the UTF-8 file and converting it to latin1, then comparing it with a regex of bytes which were were latin1 (because your source file was ascii with unofficial default latin1 upper bytes) which worked, then sending it to the database. On the new system the DB complained about \xE4 even though nothing in your script changed. right? I don't think perl changed behavior, because it is highly backward compatible, but it might be that the database driver has new defaults and might in fact already expect utf8? and \xE4 is not a valid utf8 sequence.
Regardless of what actually changed and broke things, the advice we are giving you here is to do all your processing as unicode. Use unicode for the script itself, causing the regex to be unicode, decode all incoming data before processing it, then encode all data before emitting it. And, ideally configure the mysql driver to do that encoding for you so you don't have to worry about it.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Perl encoding problem
by derion (Sexton) on Dec 14, 2021 at 12:00 UTC | |
by NERDVANA (Priest) on Dec 15, 2021 at 14:17 UTC |