in reply to Odd problems with UTF-8, regexps, and newer Perl versions

The code works for me without either the "use utf8" or "use encoding 'utf8'" statements. It works in 5.8.9, 5.10.1 and 5.12.1 (all three are installed on this system independent of each other).

A look at the doc page (perldoc utf8) shows the following:

Do not use this pragma for anything else than telling Perl that your script is written in UTF-8. The utility functions described below are directly usable without "use utf8;".
When UTF-8 becomes the standard source format, this pragma will effectively become a no-op.
The following functions are defined in the "utf8::" package by the Perl core. You do not need to say "use utf8" to use these and in fact you should not say that unless you really want to have UTF-8 source code.
So, try it without either "use" statement and see if the behaviour changes (for better or worse ;-)).

Also, I noted (belatedly) that a rollback is to v5.6. This snip from the doc's may explain:

While some limited functionality towards this does exist as of Perl 5.8.0, that is more accidental than designed; use of Unicode for the said purposes is unsupported.
  • Comment on Re: Odd problems with UTF-8, regexps, and newer Perl versions

Replies are listed 'Best First'.
Re^2: Odd problems with UTF-8, regexps, and newer Perl versions
by almut (Canon) on Jun 04, 2010 at 23:55 UTC
    Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.

    OTOH. it's likely that the OP's code is written in UTF-8 — i.e. the string "Böck" is represented in the source file as the bytes 42 c3 b6 63 68, and not as 42 f6 63 68 (Latin-1).

    Otherwise (with Latin-1), he would be getting "Malformed UTF-8 character (unexpected non-continuation byte 0x63, immediately after start byte 0xf6) at ./843208.pl line 7." with the two use directives enabled (which is different from what's shown).

    Also, without either or both of the use directives enabled, the variable would not have the utf8 flag on (i.e. no "yep, is UTF8" message), irrespective of whether it's encoded as UTF-8 or Latin-1.  This would of course fundamentally change how it's handled internally...

      Yes, as almut pointed out, my source is in UTF-8, so I do need the pragma.

      But the plot thickens:

      Going back to my original code, switching the "use encoding" for "use utf8" did not fix things. The original regular expression was much more complex, and it still dies. I've verified that even a tiny bit more complex RE will still fail even using "use utf8". It did seem a little "magical" that simply removing what should have been a harmless pragma made things work...

      The modified example follows; I ran on 5.12.1. What am I missing? Your sage help is much appreciated!

      #!/usr/bin/perl use strict vars; use utf8; binmode STDOUT, ":utf8"; my $e = "Böck"; if (utf8::is_utf8($e)) { print "yep, is UTF8: $e\n"; } # this succeeds (failed before with use encoding 'utf8', unknown why) if ($e=~ m/.*?[x]$/) { print "matched simple\n"; } print "success with simple\n"; # these die if ($e=~ m/.*?\p{Space}$/) { print "matched medium\n"; } print "success with medium\n"; if ($e=~ m/.*?[xyz]$/) { print "matched medium\n"; } print "success with medium\n"; # the original, full expression. Naturally, this dies. if ($e =~ m/(.*?)[,\p{isSpace}]+((?:\p{isAlpha}[\p{isSpace}\.]{1,2})+) +\p{isSpace}*$/) { print "matched complex\n"; } print "success with complex\n";

        I can replicate the problem, but I don't have a solution.

        One other thing that use encoding 'utf8' changes is how byte strings are interpreted when implicitly upgraded, i.e. they are then treated as UTF-8 encoded strings, while without the pragma, they are treated as Latin-1 strings:

        use utf8; #use encoding 'utf8'; use Devel::Peek; my $s = "ö"; # character string Dump $s; utf8::encode($s); # byte string c3 b6 (UTF-8 encoded ö); utf8 flag +off Dump $s; my $s2 = $s . "ö"; # implicit upgrade of $s Dump $s2;

        Default behavior:

        SV = PV(0x750b78) at 0x777c70 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x7722d0 "\303\266"\0 [UTF8 "\x{f6}"] CUR = 2 LEN = 8 SV = PV(0x750b78) at 0x777c70 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x7722d0 "\303\266"\0 CUR = 2 LEN = 8 SV = PV(0x751398) at 0x777d00 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x787010 "\303\203\302\266\303\266"\0 [UTF8 "\x{c3}\x{b6}\x{f6} +"] CUR = 6 ^^^^^^^^^^^^ LEN = 8

        With use encoding 'utf8' uncommented:

        SV = PV(0x750b78) at 0x777c88 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x7722d0 "\303\266"\0 [UTF8 "\x{f6}"] CUR = 2 LEN = 8 SV = PV(0x750b78) at 0x777c88 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x7722d0 "\303\266"\0 CUR = 2 LEN = 8 SV = PV(0x860de8) at 0x777cd0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x78fa30 "\303\266\303\266"\0 [UTF8 "\x{f6}\x{f6}"] CUR = 4 ^^^^^^ LEN = 8

        As you can see, the default behavior is to treat the byte string as Latin-1 when upgrading, i.e. the two bytes (c3 b6) that UTF-8-encode the "ö" character (\x{f6}) are being decoded into the two separate characters \x{c3} and \x{b6}.  Not so, when the encoding pragma is in effect: now they're treated as UTF-8, so they end up as \x{f6}.

        But I have no hypothesis how this difference would come into play in your regex matching...

        I'd say this is a bug.  Matching a valid unicode character string against a regex should not make the program die.

Re^2: Odd problems with UTF-8, regexps, and newer Perl versions
by ikegami (Patriarch) on Jun 05, 2010 at 03:24 UTC

    The code works for me without either the "use utf8"

    Except, say, if you took the length of the variable.