Re^3: Odd problems with UTF-8, regexps, and newer Perl versions

Yes, as almut pointed out, my source is in UTF-8, so I do need the pragma.

But the plot thickens:

Going back to my original code, switching the "use encoding" for "use utf8" did not fix things. The original regular expression was much more complex, and it still dies. I've verified that even a tiny bit more complex RE will still fail even using "use utf8". It did seem a little "magical" that simply removing what should have been a harmless pragma made things work...

The modified example follows; I ran on 5.12.1. What am I missing? Your sage help is much appreciated!

#!/usr/bin/perl 

use strict vars;
use utf8;
binmode STDOUT, ":utf8";

my $e = "Böck";

if (utf8::is_utf8($e)) { print "yep, is UTF8: $e\n"; }

# this succeeds (failed before with use encoding 'utf8', unknown why)
if ($e=~ m/.*?[x]$/) { print "matched simple\n"; }
print "success with simple\n";

# these die 
if ($e=~ m/.*?\p{Space}$/) { print "matched medium\n"; }        
print "success with medium\n";
if ($e=~ m/.*?[xyz]$/) { print "matched medium\n"; }
print "success with medium\n";

# the original, full expression. Naturally, this dies.
if ($e =~ m/(.*?)[,\p{isSpace}]+((?:\p{isAlpha}[\p{isSpace}\.]{1,2})+)
+\p{isSpace}*$/) { print "matched complex\n"; }
print "success with complex\n";
[download]

Comment on Re^3: Odd problems with UTF-8, regexps, and newer Perl versions Download Code

Replies are listed 'Best First'.
Re^4: Odd problems with UTF-8, regexps, and newer Perl versions by almut (Canon) on Jun 05, 2010 at 03:02 UTC
I can replicate the problem, but I don't have a solution. One other thing that `use encoding 'utf8'` changes is how byte strings are interpreted when implicitly upgraded, i.e. they are then treated as UTF-8 encoded strings, while without the pragma, they are treated as Latin-1 strings: `use utf8; #use encoding 'utf8'; use Devel::Peek; my $s = "ö"; # character string Dump $s; utf8::encode($s); # byte string c3 b6 (UTF-8 encoded ö); utf8 flag +off Dump $s; my $s2 = $s . "ö"; # implicit upgrade of $s Dump $s2;` [download] Default behavior: `SV = PV(0x750b78) at 0x777c70 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x7722d0 "\303\266"\0 [UTF8 "\x{f6}"] CUR = 2 LEN = 8 SV = PV(0x750b78) at 0x777c70 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x7722d0 "\303\266"\0 CUR = 2 LEN = 8 SV = PV(0x751398) at 0x777d00 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x787010 "\303\203\302\266\303\266"\0 [UTF8 "\x{c3}\x{b6}\x{f6} +"] CUR = 6 ^^^^^^^^^^^^ LEN = 8` [download] With `use encoding 'utf8'` uncommented: `SV = PV(0x750b78) at 0x777c88 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x7722d0 "\303\266"\0 [UTF8 "\x{f6}"] CUR = 2 LEN = 8 SV = PV(0x750b78) at 0x777c88 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x7722d0 "\303\266"\0 CUR = 2 LEN = 8 SV = PV(0x860de8) at 0x777cd0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x78fa30 "\303\266\303\266"\0 [UTF8 "\x{f6}\x{f6}"] CUR = 4 ^^^^^^ LEN = 8` [download] As you can see, the default behavior is to treat the byte string as Latin-1 when upgrading, i.e. the two bytes (`c3 b6`) that UTF-8-encode the "ö" character (`\x{f6}`) are being decoded into the two separate characters `\x{c3}` and `\x{b6}`. Not so, when the encoding pragma is in effect: now they're treated as UTF-8, so they end up as `\x{f6}`. But I have no hypothesis how this difference would come into play in your regex matching... I'd say this is a bug. Matching a valid unicode character string against a regex should not make the program die.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^4: Odd problems with UTF-8, regexps, and newer Perl versions
by almut (Canon) on Jun 05, 2010 at 03:02 UTC

I can replicate the problem, but I don't have a solution.

One other thing that use encoding 'utf8' changes is how byte strings are interpreted when implicitly upgraded, i.e. they are then treated as UTF-8 encoded strings, while without the pragma, they are treated as Latin-1 strings:

use utf8;
#use encoding 'utf8';
use Devel::Peek;

my $s = "ö";         # character string
Dump $s;

utf8::encode($s);    # byte string c3 b6 (UTF-8 encoded ö); utf8 flag 
+off
Dump $s;

my $s2 = $s . "ö";   # implicit upgrade of $s
Dump $s2;
[download]

Default behavior:

SV = PV(0x750b78) at 0x777c70
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x7722d0 "\303\266"\0 [UTF8 "\x{f6}"]
  CUR = 2
  LEN = 8
SV = PV(0x750b78) at 0x777c70
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x7722d0 "\303\266"\0
  CUR = 2
  LEN = 8
SV = PV(0x751398) at 0x777d00
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x787010 "\303\203\302\266\303\266"\0 [UTF8 "\x{c3}\x{b6}\x{f6}
+"]
  CUR = 6                                           ^^^^^^^^^^^^
  LEN = 8
[download]

With use encoding 'utf8' uncommented:

SV = PV(0x750b78) at 0x777c88
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x7722d0 "\303\266"\0 [UTF8 "\x{f6}"]
  CUR = 2
  LEN = 8
SV = PV(0x750b78) at 0x777c88
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x7722d0 "\303\266"\0
  CUR = 2
  LEN = 8
SV = PV(0x860de8) at 0x777cd0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x78fa30 "\303\266\303\266"\0 [UTF8 "\x{f6}\x{f6}"]
  CUR = 4                                   ^^^^^^
  LEN = 8
[download]

As you can see, the default behavior is to treat the byte string as Latin-1 when upgrading, i.e. the two bytes (c3 b6) that UTF-8-encode the "ö" character (\x{f6}) are being decoded into the two separate characters \x{c3} and \x{b6}. Not so, when the encoding pragma is in effect: now they're treated as UTF-8, so they end up as \x{f6}.

But I have no hypothesis how this difference would come into play in your regex matching...

I'd say this is a bug. Matching a valid unicode character string against a regex should not make the program die.

[reply]
[d/l]
[select]