Odd problems with UTF-8, regexps, and newer Perl versions

ablegrape has asked for the wisdom of the Perl Monks concerning the following question:

Silly me, I upgraded my development Mac to Snow Leopard, and a bunch of my UTF-8 code broke with the newer version of Perl (5.10.0). I've isolated the problem to a strange behavior with regular expressions, have RT'ed the FM, and can't find an explanation in any of the expected behaviors of newer Perl. Thinking this might be a bug, I've rolled forward to 5.12.0/1, but the problem persists. I could try to roll back to the older (5.6?) Perl, but would prefer to understand what's going on, and fix my code, if possible.

Here's a simple test case. The string in question is valid UTF-8 as far as I can tell (same problem persists when reading from a UTF-8 file), and works with most regular expressions, just not a very specific combination of them.

#!/usr/bin/perl

use strict vars;
use utf8;
use encoding 'utf8';

my $e = "Böck";

if (utf8::is_utf8($e)) { print "yep, is UTF8\n"; }

# this fails with: Malformed UTF-8 character
# seems to require the combination of a minimum-length wildcard match
# + non-matching character class. For example:
#       m/.*?[k]$/      succeeds
#       m/.*?x$/      succeeds
#       m/.*[x]$/      succeeds

if ($e=~ m/.*?[x]$/) { print "matched\n"; }

print "success with $e\n";
[download]

The program dies thus:

% ./test.pl
yep, is UTF8
Malformed UTF-8 character (fatal) at ./test.pl line 17.
[download]

Have tried lots of things, to no avail. Perhaps some monk more adept than I will have a clue as to how to approach this?

Many thanks!

Comment on Odd problems with UTF-8, regexps, and newer Perl versions Select or Download Code

Replies are listed 'Best First'.
Re: Odd problems with UTF-8, regexps, and newer Perl versions by almut (Canon) on Jun 04, 2010 at 21:17 UTC
For me the problem goes away when I comment out the `use encoding 'utf8'` line (tested with 5.10.1). Why do you think you need it? — `use utf8` already tells Perl that the script source is in UTF-8 (and you can always use binmode to change layers for STDIN and STDOUT).	[reply] [d/l] [select]
Re^2: Odd problems with UTF-8, regexps, and newer Perl versions by choroba (Cardinal) on Jun 04, 2010 at 21:33 UTC
I can replicate the problem in perl 5.10.0 too, but not in 5.8.8. almut's solution solves it.	[reply]
Re^2: Odd problems with UTF-8, regexps, and newer Perl versions by ablegrape (Initiate) on Jun 04, 2010 at 23:36 UTC
Thanks for the quick reply. I tried that, too, and while the regexp then works, the behavior changes. With only 'use utf8': % ./test.pl yep, is UTF8 success with B?ck I see, "use encoding" also sets binmode on STDIN and STDOUT, so that while just using 'use' I need to explicitly add the binmode. With use utf8 plus "binmode STDOUT ':utf8'": % ./test.pl yep, is UTF8 success with Böck (My, Perl's unicode handling is complicated.) Now to see if I can apply this learning successfully to the original application, which is far more complex...	[reply]
Re^3: Odd problems with UTF-8, regexps, and newer Perl versions by moritz (Cardinal) on Jun 05, 2010 at 06:15 UTC
I see, "use encoding" also sets binmode on STDIN and STDOUT, so that while just using 'use' I need to explicitly add the binmode. You can also use the open pragma for that, and also for future calls to open. Perl 6 - links to (nearly) everything that is Perl 6.	[reply]
Re: Odd problems with UTF-8, regexps, and newer Perl versions by proceng (Scribe) on Jun 04, 2010 at 22:52 UTC
The code works for me without either the "use utf8" or "use encoding 'utf8'" statements. It works in 5.8.9, 5.10.1 and 5.12.1 (all three are installed on this system independent of each other). A look at the doc page (perldoc utf8) shows the following: Do not use this pragma for anything else than telling Perl that your script is written in UTF-8. The utility functions described below are directly usable without "use utf8;". When UTF-8 becomes the standard source format, this pragma will effectively become a no-op. The following functions are defined in the "utf8::" package by the Perl core. You do not need to say "use utf8" to use these and in fact you should not say that unless you really want to have UTF-8 source code. So, try it without either "use" statement and see if the behaviour changes (for better or worse ;-)). Also, I noted (belatedly) that a rollback is to v5.6. This snip from the doc's may explain: While some limited functionality towards this does exist as of Perl 5.8.0, that is more accidental than designed; use of Unicode for the said purposes is unsupported.	[reply]
Re^2: Odd problems with UTF-8, regexps, and newer Perl versions by almut (Canon) on Jun 04, 2010 at 23:55 UTC
Do not use this pragma for anything else than telling Perl that your script is written in UTF-8. OTOH. it's likely that the OP's code is written in UTF-8 — i.e. the string `"Böck"` is represented in the source file as the bytes `42 c3 b6 63 68`, and not as `42 f6 63 68` (Latin-1). Otherwise (with Latin-1), he would be getting `"Malformed UTF-8 character (unexpected non-continuation byte 0x63, immediately after start byte 0xf6) at ./843208.pl line 7."` with the two `use` directives enabled (which is different from what's shown). Also, without either or both of the `use` directives enabled, the variable would not have the utf8 flag on (i.e. no `"yep, is UTF8"` message), irrespective of whether it's encoded as UTF-8 or Latin-1. This would of course fundamentally change how it's handled internally...	[reply] [d/l] [select]
Re^3: Odd problems with UTF-8, regexps, and newer Perl versions by ablegrape (Initiate) on Jun 05, 2010 at 00:41 UTC
Yes, as almut pointed out, my source is in UTF-8, so I do need the pragma. But the plot thickens: Going back to my original code, switching the "use encoding" for "use utf8" did not fix things. The original regular expression was much more complex, and it still dies. I've verified that even a tiny bit more complex RE will still fail even using "use utf8". It did seem a little "magical" that simply removing what should have been a harmless pragma made things work... The modified example follows; I ran on 5.12.1. What am I missing? Your sage help is much appreciated! #!/usr/bin/perl use strict vars; use utf8; binmode STDOUT, ":utf8"; my $e = "Böck"; if (utf8::is_utf8($e)) { print "yep, is UTF8: $e\n"; } # this succeeds (failed before with use encoding 'utf8', unknown why) if ($e=~ m/.?[x]$/) { print "matched simple\n"; } print "success with simple\n"; # these die if ($e=~ m/.?\p{Space}$/) { print "matched medium\n"; } print "success with medium\n"; if ($e=~ m/.?[xyz]$/) { print "matched medium\n"; } print "success with medium\n"; # the original, full expression. Naturally, this dies. if ($e =~ m/(.?)[,\p{isSpace}]+((?:\p{isAlpha}[\p{isSpace}\.]{1,2})+) +\p{isSpace}*$/) { print "matched complex\n"; } print "success with complex\n"; [download]	[reply] [d/l]
Re^4: Odd problems with UTF-8, regexps, and newer Perl versions by almut (Canon) on Jun 05, 2010 at 03:02 UTC
Re^2: Odd problems with UTF-8, regexps, and newer Perl versions by ikegami (Patriarch) on Jun 05, 2010 at 03:24 UTC
The code works for me without either the "use utf8" Except, say, if you took the length of the variable.	[reply]
Re: Odd problems with UTF-8, regexps, and newer Perl versions (/i) by tye (Sage) on Jun 05, 2010 at 06:11 UTC
Add /i and I get a more verbose (and non-fatal) error only if "use encoding" is commented out: `Malformed UTF-8 character (unexpected continuation byte 0xb6, with no +preceding start byte) in pattern match (m//)` [download] Which indicates that the regex engine is starting a step at the second byte of the multi-byte character. And the ways that the error comes and goes for nonsensical changes makes me suspect there might be something like alignment or buffer overflow involved. (Updated.) - tye	[reply] [d/l]
Re: Odd problems with UTF-8, regexps, and newer Perl versions by westrock2000 (Beadle) on Jun 05, 2010 at 11:38 UTC
I dont know if this helps you but here is how I got UTF-8 to work across systems using both 5.8 and 5.6.1 (and had to use uxterm on the 5.6.1 to get xterm to display it correctly) `#!/usr/bin/perl -w BEGIN{ if ($] < 5.008){ require utf8; utf8->import(); } } if ($] >= 5.008){ binmode STDOUT, 'utf8';}` [download] Basically if perl version ($]) is below 5.8 is uses one method of setting UTF-8 and if its equal to 5.8 or above it sets UTF-8 another way. I don't know how proper this is, but I was getting all kinds of trash whenever I tried to display Unicode (ISO-10646) on older Red Hat 7.3 in Xterm. After about 2 days of surfing I came up with the combination of using that in the script and launching in uxterm...now Unicode calls display properly on both types of systems.	[reply] [d/l]