in reply to Re: Perl encoding problem
in thread Perl encoding problem

So then here's your next problem:

if I try to insert it into a latin1 mySQL database

Any mysql from at least the last decade will support UTF8 mode. Assuming you have DBI there, the connection options should be at least:

connect($dsn, $user, $pass, { RaiseError => 1, AutoCommit => 1, mysql_enable_utf8 => 1, on_connect_call => 'set_strict_mode', })
If this isn't an option and you really need mysql to be latin1, then you need to convert the incoming data to latin1 before you insert it.
use Encode qw(encode); $_ = encode('iso-8859-1', $_, 1) for values %$add;

Note that that dies if there is any character in the string that is outside latin1.

Any time you mess with encodings, you need to make a distinction of which perl scalars are holding "unicode strings" vs. which are holding "byte sequences". It would be nice if Perl tracked this for you, but it does not. (perl internally tracks whether it has wide characters in a scalar, but that is not the same as tracking the logical intent of the string) Setting an encoding on a filehandle means you are receiving "unicode strings" from it. Before you pass those strings to an API that isn't unicode-aware, you need to choose what bytes they should become. You can manually convert to the bytes of your choice using the Encode module. Or, upgrade your DBI connection to be unicode aware.

Edit: re-reading your initial post, I realize this could use more clarification. You were previously loading the UTF-8 file and converting it to latin1, then comparing it with a regex of bytes which were were latin1 (because your source file was ascii with unofficial default latin1 upper bytes) which worked, then sending it to the database. On the new system the DB complained about \xE4 even though nothing in your script changed. right? I don't think perl changed behavior, because it is highly backward compatible, but it might be that the database driver has new defaults and might in fact already expect utf8? and \xE4 is not a valid utf8 sequence.

Regardless of what actually changed and broke things, the advice we are giving you here is to do all your processing as unicode. Use unicode for the script itself, causing the regex to be unicode, decode all incoming data before processing it, then encode all data before emitting it. And, ideally configure the mysql driver to do that encoding for you so you don't have to worry about it.

Replies are listed 'Best First'.
Re^3: Perl encoding problem
by derion (Sexton) on Dec 14, 2021 at 12:00 UTC

    Thank you very much for your additional comments.

    The scripts have not changed but the Perl version and the locale of the server. These are the variables I thought to be the reason for my problems.

    The migration to a mySQL UTF8 DB is one big target but there is a bunch of things I will have to modify to achieve this.
    The migration of the server is one first step.
    The diffuculties I am having now help me to prepare the next steps - hopefully.

    I have a couple of files with different encodings that all end up in the database
    which is at the moment latin1_swedish_ci and will be some utf8 sometimes.
    All migrated scripts and files worked fine so far and the first problem occured with the regex with the foreign characters in this script.

    I now added:
    foreach my $key (keys %$add) { $add->{$key} = encode('iso-8859-1', $add->{$key}, 1); }

    which is what I would have expected to be done originally.
    The outcome of the encoding is that all lines with foreign characters do not appear in the database anymore.
    I get "Incorrect string value: '\xE4lter'" for example.
    Without the encoding all lines are added in the database and all of them look alright.

    So as far as I understand you my first assumption of the encoding of the strings in $add is wrong and I should not just insert them into a latin1_swedish_ci DB.
    On the other hand encoding them produces errors during the import into a database with latin1_swedish_ci collation.
    So I am having another problem I donīt yet understand yet.

      Can you find out whether that error is coming from the mysql server or from the DBI driver? If you set environment variable DBI_TRACE=1 it should clarify whether the query was sent to the server and rejected, or if it failed before sending.

      If it failed before sending, then what I think is most likely is that the DBI driver and/or mysql client library (which is presumably a new version as part of your new perl version) has gotten "smarter" and is trying to do the encode for you, and expects that you provide it a logically "unicode string" for it to encode. And if this is really the case, that is good news because you don't have to manually encode things, and (probably) will continue to work without further code changes when you switch to a utf8 database.