Perl encoding problem

derion has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Perl encoding problem by dave_the_m (Monsignor) on Dec 13, 2021 at 22:14 UTC
If you want to use literal utf8 characters in your source code (literal strings, regexes etc), then you need to tell the perl interpreter that the source code should be treated as utf8 by adding 'use utf8;' at the top of your script. Dave.	[reply]
Re: Perl encoding problem by kcott (Archbishop) on Dec 13, 2021 at 22:59 UTC
G'day derion, My first guess is that if `/behälter/i` is causing problems, but `/beh\xE4lter/i` is not, then using the utf8 pragma might be all you need. Having said that, you've only provided code fragments. Parts that you've omitted may be important, e.g. how you call the open function. Please provide an SSCCE that we can run: you should keep this as short as possible while still showing the problem; also, please provide a short input file (probably only needs to be a few lines long). The error you show, "Incorrect string value: '\xE4", contains an unexpected apostrophe: perhaps the actual error message, a typo, an SQL problem, or something else. Please paste verbatim program output within `<code>...</code>` tags, rather than typing by hand. I generally prefer `"\x{NN}"` to `"\xNN"`, as it removes any possible ambiguity (especially if NN is followed by other digits). I don't see a problem with that here, but it could be elsewhere: a little defensive programming never hurts. And just a heads-up, "`U+00E4 LATIN SMALL LETTER A WITH DIAERESIS`" (`ä`) canonically decomposes into `U+0061` (`a`) and `U+0308` (`¨`). Again, I don't see that as an issue here, but maybe worth knowing about. See PDF "Unicode Code Chart: 0080 - 00ff". — Ken	[reply] [d/l] [select]
Re: Perl encoding problem by ikegami (Patriarch) on Dec 13, 2021 at 23:24 UTC
You need to add `use utf8;`. Unless you use `use utf8;`, Perl expects your code to be encoded using ASCII. Literals are 8-bit clean.^[1] I'm guessing you didn't use `use utf8;`. If that the case, you can't possibly have `/behälter/` since `ä` isn't found in the ASCII character set. You're actually giving Perl something equivalent to `/beh\xC3\xA4lter/`. As you indicated, the proper solution is to give `/beh\xE4lter/` or equivalent. For `/behälter/` to be equivalent, you need to encode your source code using UTF-8 (as you're already doing), and you need to tell Perl you've done that using `use utf8;` (which needs doing). This means that `"<byte with value 0xFF>"` is equivalent to `"\xFF"`.	[reply] [d/l] [select]
Re: Perl encoding problem by derion (Sexton) on Dec 14, 2021 at 00:25 UTC
Thank you very much for your replies / comments Dave, Ken and ikegami! "use utf8;" seems to do the job which surprises me as I probably misunderstood what it does. Besides the - hopefully - solution provided by all I appreciate the side suggestions made. '\xE4 was a typo I cut off the rest: '\xE4lter ...' Posting SSCCE code is a bit of a challenge for me as my problem is the data inserted into the database. I am always suspicious of what kind of data I have in the database in the end, even if it looks correct at first sight. At the moment it looks correct and with other approaches I get errors. The source file imported is encoded in UTF-8. `#!/usr/bin/perl use utf8; use open IN => ':encoding(UTF-8)'; open (DATA, "source.txt") \|\| die "error opening file"; while (<DATA>) { my $add; if ($_ =~ /behälter/i) { $add->{Category} = 'Resttonerbehälter'; $add->{Description} = $_; } my $added = $DB->table('tablename')->add($add); } close DATA;` [download] The values of $add->{Category} and $add->{Description} are both inserted into a database with latin1_swedish_ci collation. I would have expected these values to cause problems with the "use utf8;" but it seems like they are latin1 and I guess if I would want them to be UTF-8 I would have to encode them and the encoding of the strings have nothing to do with "use utf8;".	[reply] [d/l]
Re^2: Perl encoding problem by NERDVANA (Priest) on Dec 14, 2021 at 07:51 UTC
So then here's your next problem: if I try to insert it into a latin1 mySQL database Any mysql from at least the last decade will support UTF8 mode. Assuming you have DBI there, the connection options should be at least: `connect($dsn, $user, $pass, { RaiseError => 1, AutoCommit => 1, mysql_enable_utf8 => 1, on_connect_call => 'set_strict_mode', })` [download] If this isn't an option and you really need mysql to be latin1, then you need to convert the incoming data to latin1 before you insert it. `use Encode qw(encode); $_ = encode('iso-8859-1', $_, 1) for values %$add;` [download] Note that that dies if there is any character in the string that is outside latin1. Any time you mess with encodings, you need to make a distinction of which perl scalars are holding "unicode strings" vs. which are holding "byte sequences". It would be nice if Perl tracked this for you, but it does not. (perl internally tracks whether it has wide characters in a scalar, but that is not the same as tracking the logical intent of the string) Setting an encoding on a filehandle means you are receiving "unicode strings" from it. Before you pass those strings to an API that isn't unicode-aware, you need to choose what bytes they should become. You can manually convert to the bytes of your choice using the Encode module. Or, upgrade your DBI connection to be unicode aware. Edit: re-reading your initial post, I realize this could use more clarification. You were previously loading the UTF-8 file and converting it to latin1, then comparing it with a regex of bytes which were were latin1 (because your source file was ascii with unofficial default latin1 upper bytes) which worked, then sending it to the database. On the new system the DB complained about \xE4 even though nothing in your script changed. right? I don't think perl changed behavior, because it is highly backward compatible, but it might be that the database driver has new defaults and might in fact already expect utf8? and \xE4 is not a valid utf8 sequence. Regardless of what actually changed and broke things, the advice we are giving you here is to do all your processing as unicode. Use unicode for the script itself, causing the regex to be unicode, decode all incoming data before processing it, then encode all data before emitting it. And, ideally configure the mysql driver to do that encoding for you so you don't have to worry about it.	[reply] [d/l] [select]
Re^3: Perl encoding problem by derion (Sexton) on Dec 14, 2021 at 12:00 UTC
Thank you very much for your additional comments. The scripts have not changed but the Perl version and the locale of the server. These are the variables I thought to be the reason for my problems. The migration to a mySQL UTF8 DB is one big target but there is a bunch of things I will have to modify to achieve this. The migration of the server is one first step. The diffuculties I am having now help me to prepare the next steps - hopefully. I have a couple of files with different encodings that all end up in the database which is at the moment latin1_swedish_ci and will be some utf8 sometimes. All migrated scripts and files worked fine so far and the first problem occured with the regex with the foreign characters in this script. I now added: `foreach my $key (keys %$add) { $add->{$key} = encode('iso-8859-1', $add->{$key}, 1); }` [download] which is what I would have expected to be done originally. The outcome of the encoding is that all lines with foreign characters do not appear in the database anymore. I get "Incorrect string value: '\xE4lter'" for example. Without the encoding all lines are added in the database and all of them look alright. So as far as I understand you my first assumption of the encoding of the strings in $add is wrong and I should not just insert them into a latin1_swedish_ci DB. On the other hand encoding them produces errors during the import into a database with latin1_swedish_ci collation. So I am having another problem I don´t yet understand yet.	[reply] [d/l]
Re^4: Perl encoding problem by NERDVANA (Priest) on Dec 15, 2021 at 14:17 UTC
Re^2: Perl encoding problem by kcott (Archbishop) on Dec 14, 2021 at 00:56 UTC
I'm glad to see `use utf8;` was sufficient to resolve your problem. Unrelated to your initial problem, I thought that I'd point out a few issues with: `open (DATA, "source.txt") \|\| die "error opening file";` [download] Use a lexical filehandle and the 3-argument form of open. That documentation has a number of examples and further discussion. `DATA` is a package variable and a particularly bad choice anyway. See "perldata: Special Literals". Hand-crafting `die` messages is tedious and easy to get wrong. `"error opening file"` gives no indication of the file that had the problem or what the problem was (for instance, non-existent file or insufficient privileges). Save yourself the effort of doing this task and let Perl report errors for you with the autodie pragma. — Ken	[reply] [d/l] [select]
Re^3: Perl encoding problem by derion (Sexton) on Dec 14, 2021 at 13:50 UTC
The way I used open was sloppy, I should know at least that better. autodie is something new for me, tried it, liked it, thanks a lot for this something new.	[reply]