Re: RegExp breaks in Perl 5.10

Hmm, I was sufficiently surprised by this behaviour (that I've not heard of before) that I went looking. First off, your code fragment is not much use, as it does not define what $R2 contains. So I went and looked at the source, and ripped the following out of its guts:

use strict;
use warnings;

my @word = qw(
    constituci\xf3n contribuci\xf3n destituci\xf3n devoluci\xf3n dismi
+nuci\xf3n
    constituciones contribuciones destituciones devoluciones disminuci
+ones
    foo
);

my $vowels     = 'aeiou\xe1\xe9\xed\xf3\xfa\xfc';
my $consonants = 'bcdfghjklmn\xf1pqrstvwxyz';

my $revowel      = qr/[$vowels]/;
my $reconsonants = qr/[$consonants]/;
my $R2;
my $suffix;

for my $word (@word) {
    ($R2) = $word =~ /^.*?$revowel$reconsonants.*?$revowel$reconsonant
+s(.*)$/;
    $R2 ||= '';
    if ( ($suffix) = $R2 =~ /(uciones|uci\xf3n)$/ ) {
        # uci\xf3n uciones
        # replace with u if in R2
        $word =~ s/$suffix$/u/;
        print "Step 1 case 4: $word\n";
    }
}
[download]

(Those \xnn characters really are Latin-1 characters, that's just a direct cut'n'paste from my shell introducing the artifact).

And that runs just fine here, all the way up to "perl, v5.11.0 DEVEL33323 built for i386-freebsd-64int". So there's something else going on. Both "ución" and "uciones" match just fine. Perhaps the tester platforms are running in a different locale. To play it safe, I suggest you encode your program in UTF-8 and slap a use utf8 at the top and be done with it. At least I think that's the correct best practice. Thinking about encoding makes my head explode.

• another intruder with the mooring in the heart of the Perl

Comment on Re: RegExp breaks in Perl 5.10 Download Code

Replies are listed 'Best First'.
Re^2: RegExp breaks in Perl 5.10 by almut (Canon) on Mar 06, 2008 at 21:13 UTC
I think the issue with the module's original code is that the one side of the match has been decoded from UTF-8 (the word list from the file) while the other is in Latin1 (the literal strings in the source). In your test case, both are in Latin1, so they match. When adding (at the beginning of the loop) `$word = Encode::decode("iso-8859-1", $word); # force utf8 flag on print "$word:\n";` [download] I can reproduce the problem, i.e. when forcing utf8, I get constitución: contribución: destitución: devolución: disminución: constituciones: Step 1 case 4: constitu contribuciones: Step 1 case 4: contribu destituciones: Step 1 case 4: destitu devoluciones: Step 1 case 4: devolu disminuciones: Step 1 case 4: disminu foo: while with your original test, the output is constitución: Step 1 case 4: constitu contribución: Step 1 case 4: contribu destitución: Step 1 case 4: destitu devolución: Step 1 case 4: devolu disminución: Step 1 case 4: disminu constituciones: Step 1 case 4: constitu contribuciones: Step 1 case 4: contribu destituciones: Step 1 case 4: destitu devoluciones: Step 1 case 4: devolu disminuciones: Step 1 case 4: disminu foo:	[reply] [d/l]
Re^2: RegExp breaks in Perl 5.10 by eserte (Deacon) on Mar 06, 2008 at 20:52 UTC
If there's no "use locale" in the script then it should be not locale-dependent.	[reply]