derion has asked for the wisdom of the Perl Monks concerning the following question:

I try to migrate some scripts from one server to another and there are a couple of differences.
The old server has Perl 5.10.1 and locale settings LC_CTYPE="C".
The new server has 5.30.0 and locale settings LC_CTYPE="C.UTF-8".
While I was able to parse an utf8 textfile with open and doing a while on every line like this:
my $category; my $decoded_text = decode('UTF-8', $_); #my $latin1_html = encode('iso-latin-1', $decoded_text ); my $latin1_html = encode('iso-8859-1', $decoded_text ); if ($latin1_html =~ /behälter/i) { $category = 'Behälter'; }
This seems not work in the new environment.
I tried to modify some things and ended up with:
use open IN => ':encoding(UTF-8)'; use open OUT => ':encoding(iso-8859-1)';
before opening the file and
my $category; if ($_=~ /behälter/i) { $category = 'Behälter'; }
This works at first sight but $category makes difficulties and seems to be another encoding than $_.
At the moment I presume $_ is ISO-8859-1 and $category something else. E.g. if I try to insert it into a latin1 mySQL database it throws the error "Incorrect string value: '\xE4".
The perl script file is an ANSI file.
If I run the script with an UTF8 encoded file I can do the following:
my $category; if ($_=~ /beh\xE4lter/i) { $category = 'Behälter'; }

This works but it seems to be more a patch to a symptom than a cure to the problem. I really would like to understand what kind of mistake I am making and what approach I could take to handle file parsing, string modification and storing in files or databases the right way in the new environment. Thank you very much for your comments.

Replies are listed 'Best First'.
Re: Perl encoding problem
by dave_the_m (Monsignor) on Dec 13, 2021 at 22:14 UTC
    If you want to use literal utf8 characters in your source code (literal strings, regexes etc), then you need to tell the perl interpreter that the source code should be treated as utf8 by adding 'use utf8;' at the top of your script.

    Dave.

Re: Perl encoding problem
by kcott (Archbishop) on Dec 13, 2021 at 22:59 UTC

    G'day derion,

    My first guess is that if /behälter/i is causing problems, but /beh\xE4lter/i is not, then using the utf8 pragma might be all you need.

    Having said that, you've only provided code fragments. Parts that you've omitted may be important, e.g. how you call the open function. Please provide an SSCCE that we can run: you should keep this as short as possible while still showing the problem; also, please provide a short input file (probably only needs to be a few lines long).

    The error you show, "Incorrect string value: '\xE4", contains an unexpected apostrophe: perhaps the actual error message, a typo, an SQL problem, or something else. Please paste verbatim program output within <code>...</code> tags, rather than typing by hand.

    I generally prefer "\x{NN}" to "\xNN", as it removes any possible ambiguity (especially if NN is followed by other digits). I don't see a problem with that here, but it could be elsewhere: a little defensive programming never hurts.

    And just a heads-up, "U+00E4 LATIN SMALL LETTER A WITH DIAERESIS" (ä) canonically decomposes into U+0061 (a) and U+0308 (¨). Again, I don't see that as an issue here, but maybe worth knowing about. See PDF "Unicode Code Chart: 0080 - 00ff".

    — Ken

Re: Perl encoding problem
by ikegami (Patriarch) on Dec 13, 2021 at 23:24 UTC

    You need to add use utf8;.


    Unless you use use utf8;, Perl expects your code to be encoded using ASCII. Literals are 8-bit clean.[1]

    I'm guessing you didn't use use utf8;. If that the case, you can't possibly have /behälter/ since ä isn't found in the ASCII character set. You're actually giving Perl something equivalent to /beh\xC3\xA4lter/.

    As you indicated, the proper solution is to give /beh\xE4lter/ or equivalent. For /behälter/ to be equivalent, you need to encode your source code using UTF-8 (as you're already doing), and you need to tell Perl you've done that using use utf8; (which needs doing).


    1. This means that "<byte with value 0xFF>" is equivalent to "\xFF".
Re: Perl encoding problem
by derion (Sexton) on Dec 14, 2021 at 00:25 UTC
    Thank you very much for your replies / comments Dave, Ken and ikegami!
    "use utf8;" seems to do the job which surprises me as I probably misunderstood what it does.
    Besides the - hopefully - solution provided by all I appreciate the side suggestions made.
    '\xE4 was a typo I cut off the rest: '\xE4lter ...'

    Posting SSCCE code is a bit of a challenge for me as my problem is the data inserted into the database.
    I am always suspicious of what kind of data I have in the database in the end,
    even if it looks correct at first sight. At the moment it looks correct and with other approaches I get errors.
    The source file imported is encoded in UTF-8.
    #!/usr/bin/perl use utf8; use open IN => ':encoding(UTF-8)'; open (DATA, "source.txt") || die "error opening file"; while (<DATA>) { my $add; if ($_ =~ /behälter/i) { $add->{Category} = 'Resttonerbehälter'; $add->{Description} = $_; } my $added = $DB->table('tablename')->add($add); } close DATA;
    The values of $add->{Category} and $add->{Description} are both inserted into a database with latin1_swedish_ci collation. I would have expected these values to cause problems with the "use utf8;" but it seems like they are latin1 and I guess if I would want them to be UTF-8 I would have to encode them and the encoding of the strings have nothing to do with "use utf8;".
      So then here's your next problem:

      if I try to insert it into a latin1 mySQL database

      Any mysql from at least the last decade will support UTF8 mode. Assuming you have DBI there, the connection options should be at least:

      connect($dsn, $user, $pass, { RaiseError => 1, AutoCommit => 1, mysql_enable_utf8 => 1, on_connect_call => 'set_strict_mode', })
      If this isn't an option and you really need mysql to be latin1, then you need to convert the incoming data to latin1 before you insert it.
      use Encode qw(encode); $_ = encode('iso-8859-1', $_, 1) for values %$add;

      Note that that dies if there is any character in the string that is outside latin1.

      Any time you mess with encodings, you need to make a distinction of which perl scalars are holding "unicode strings" vs. which are holding "byte sequences". It would be nice if Perl tracked this for you, but it does not. (perl internally tracks whether it has wide characters in a scalar, but that is not the same as tracking the logical intent of the string) Setting an encoding on a filehandle means you are receiving "unicode strings" from it. Before you pass those strings to an API that isn't unicode-aware, you need to choose what bytes they should become. You can manually convert to the bytes of your choice using the Encode module. Or, upgrade your DBI connection to be unicode aware.

      Edit: re-reading your initial post, I realize this could use more clarification. You were previously loading the UTF-8 file and converting it to latin1, then comparing it with a regex of bytes which were were latin1 (because your source file was ascii with unofficial default latin1 upper bytes) which worked, then sending it to the database. On the new system the DB complained about \xE4 even though nothing in your script changed. right? I don't think perl changed behavior, because it is highly backward compatible, but it might be that the database driver has new defaults and might in fact already expect utf8? and \xE4 is not a valid utf8 sequence.

      Regardless of what actually changed and broke things, the advice we are giving you here is to do all your processing as unicode. Use unicode for the script itself, causing the regex to be unicode, decode all incoming data before processing it, then encode all data before emitting it. And, ideally configure the mysql driver to do that encoding for you so you don't have to worry about it.

        Thank you very much for your additional comments.

        The scripts have not changed but the Perl version and the locale of the server. These are the variables I thought to be the reason for my problems.

        The migration to a mySQL UTF8 DB is one big target but there is a bunch of things I will have to modify to achieve this.
        The migration of the server is one first step.
        The diffuculties I am having now help me to prepare the next steps - hopefully.

        I have a couple of files with different encodings that all end up in the database
        which is at the moment latin1_swedish_ci and will be some utf8 sometimes.
        All migrated scripts and files worked fine so far and the first problem occured with the regex with the foreign characters in this script.

        I now added:
        foreach my $key (keys %$add) { $add->{$key} = encode('iso-8859-1', $add->{$key}, 1); }

        which is what I would have expected to be done originally.
        The outcome of the encoding is that all lines with foreign characters do not appear in the database anymore.
        I get "Incorrect string value: '\xE4lter'" for example.
        Without the encoding all lines are added in the database and all of them look alright.

        So as far as I understand you my first assumption of the encoding of the strings in $add is wrong and I should not just insert them into a latin1_swedish_ci DB.
        On the other hand encoding them produces errors during the import into a database with latin1_swedish_ci collation.
        So I am having another problem I don´t yet understand yet.

      I'm glad to see use utf8; was sufficient to resolve your problem.

      Unrelated to your initial problem, I thought that I'd point out a few issues with:

      open (DATA, "source.txt") || die "error opening file";
      • Use a lexical filehandle and the 3-argument form of open. That documentation has a number of examples and further discussion.
      • DATA is a package variable and a particularly bad choice anyway. See "perldata: Special Literals".
      • Hand-crafting die messages is tedious and easy to get wrong. "error opening file" gives no indication of the file that had the problem or what the problem was (for instance, non-existent file or insufficient privileges). Save yourself the effort of doing this task and let Perl report errors for you with the autodie pragma.

      — Ken

        The way I used open was sloppy, I should know at least that better.
        autodie is something new for me, tried it, liked it, thanks a lot for this something new.