RegExp breaks in Perl 5.10

jfraire has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: RegExp breaks in Perl 5.10 by almut (Canon) on Mar 06, 2008 at 19:42 UTC
Adding `use encoding "iso-8859-1";` (to explicitly tell Perl that your source is in iso-8859-1) at the top of your module did fix it for me: `$ make test PERL_DL_NONLAZY=1 /usr/local/perl/5.10.0/bin/perl "-MExtUtils::Command +::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t t/Lingua-Stem-Es....ok + All tests successful. Files=1, Tests=28379, 11 wallclock secs ( 8.21 cusr + 0.30 csys = 8. +51 CPU)` [download]	[reply] [d/l] [select]
Re^2: RegExp breaks in Perl 5.10 by eserte (Deacon) on Mar 06, 2008 at 19:44 UTC
But note that still perl 5.10.0 is broken...	[reply]
Re^3: RegExp breaks in Perl 5.10 by almut (Canon) on Mar 06, 2008 at 20:05 UTC
...seems like it, yes. At least the test code from the thread you linked to still produces lots of "not ok"s with 5.10.0. But I'll leave wiser heads than mine to comment on that... :)	[reply]
Re: RegExp breaks in Perl 5.10 by grinder (Bishop) on Mar 06, 2008 at 20:24 UTC
Hmm, I was sufficiently surprised by this behaviour (that I've not heard of before) that I went looking. First off, your code fragment is not much use, as it does not define what $R2 contains. So I went and looked at the source, and ripped the following out of its guts: use strict; use warnings; my @word = qw( constituci\xf3n contribuci\xf3n destituci\xf3n devoluci\xf3n dismi +nuci\xf3n constituciones contribuciones destituciones devoluciones disminuci +ones foo ); my $vowels = 'aeiou\xe1\xe9\xed\xf3\xfa\xfc'; my $consonants = 'bcdfghjklmn\xf1pqrstvwxyz'; my $revowel = qr/[$vowels]/; my $reconsonants = qr/[$consonants]/; my $R2; my $suffix; for my $word (@word) { ($R2) = $word =~ /^.?$revowel$reconsonants.?$revowel$reconsonant +s(.*)$/; $R2 \|\|= ''; if ( ($suffix) = $R2 =~ /(uciones\|uci\xf3n)$/ ) { # uci\xf3n uciones # replace with u if in R2 $word =~ s/$suffix$/u/; print "Step 1 case 4: $word\n"; } } [download] (Those `\xnn` characters really are Latin-1 characters, that's just a direct cut'n'paste from my shell introducing the artifact). And that runs just fine here, all the way up to "perl, v5.11.0 DEVEL33323 built for i386-freebsd-64int". So there's something else going on. Both "ución" and "uciones" match just fine. Perhaps the tester platforms are running in a different locale. To play it safe, I suggest you encode your program in UTF-8 and slap a `use utf8` at the top and be done with it. At least I think that's the correct best practice. Thinking about encoding makes my head explode. • another intruder with the mooring in the heart of the Perl	[reply] [d/l]
Re^2: RegExp breaks in Perl 5.10 by almut (Canon) on Mar 06, 2008 at 21:13 UTC
I think the issue with the module's original code is that the one side of the match has been decoded from UTF-8 (the word list from the file) while the other is in Latin1 (the literal strings in the source). In your test case, both are in Latin1, so they match. When adding (at the beginning of the loop) `$word = Encode::decode("iso-8859-1", $word); # force utf8 flag on print "$word:\n";` [download] I can reproduce the problem, i.e. when forcing utf8, I get constitución: contribución: destitución: devolución: disminución: constituciones: Step 1 case 4: constitu contribuciones: Step 1 case 4: contribu destituciones: Step 1 case 4: destitu devoluciones: Step 1 case 4: devolu disminuciones: Step 1 case 4: disminu foo: while with your original test, the output is constitución: Step 1 case 4: constitu contribución: Step 1 case 4: contribu destitución: Step 1 case 4: destitu devolución: Step 1 case 4: devolu disminución: Step 1 case 4: disminu constituciones: Step 1 case 4: constitu contribuciones: Step 1 case 4: contribu destituciones: Step 1 case 4: destitu devoluciones: Step 1 case 4: devolu disminuciones: Step 1 case 4: disminu foo:	[reply] [d/l]
Re^2: RegExp breaks in Perl 5.10 by eserte (Deacon) on Mar 06, 2008 at 20:52 UTC
If there's no "use locale" in the script then it should be not locale-dependent.	[reply]
Re: RegExp breaks in Perl 5.10 by eserte (Deacon) on Mar 06, 2008 at 19:38 UTC
I don't know if it's the same bug, but maybe it's in the same class: Another regexp failure with utf8-flagged string and byte-flagged pattern. I would propose that you force utf8 context on both sides, regexp pattern and variable. Maybe the errors will go away.	[reply]
Re: RegExp breaks in Perl 5.10 by jfraire (Beadle) on Mar 06, 2008 at 22:54 UTC
It strikes me as odd that of 37 CPAN Testers' reports, 7 passed OK and they are all 5.8.8. Those failed reports for 5.8.8 are because I did not declare Test::Exception as a pre-requisite. Moreover, the module is heavy on regexes. It is also really odd that this is the only one affected, and only one half of it! Out of more than 28K words, only 15 fail... I wouldn't think this is related to encoding. Since almut has fixed the module by declaring the encoding, I think that grinder's advice of moving to UTF-8 and declaring so is the most general solution. I have done it in my local copy of the module and the test suite still passes OK under my Perl 5.8.8 on Linux. Update:7 passes and 28 fails.	[reply]
Re^2: RegExp breaks in Perl 5.10 by almut (Canon) on Mar 07, 2008 at 02:29 UTC
Out of more than 28K words, only 15 fail... I wouldn't think this is related to encoding. Yes, this is definitely odd, in particular as the ó character, which was causing problems in your case, was not one of the problem chars in Slaven Rezic's test code (which I'm re-posting here for easy reference): `for my $chr (160 .. 255) { my $chr_byte = chr($chr); my $chr_utf8 = chr($chr); utf8::upgrade($chr_utf8); my $rx = qr{$chr_byte\|X}i; print $chr . " " . ($chr_utf8 =~ $rx ? "ok" : "not ok") . "\n"; }` [download] Here, it was mainly uppercase letters where the match failed. Note that the matching is done case-insensitively (which you don't do in your module). However, when you remove the `'i'` from the `qr{}`, everything seems to work fine... So, I played around with this a bit more, and in fact it turned out the bug is highly dependent on context (which could explain why most of your regexes kept working). For example, this modified test code still works fine `for my $chr (160 .. 255) { my $chr_byte = chr($chr); my $chr_utf8 = chr($chr); utf8::upgrade($chr_utf8); my $rx = qr{uci$chr_byte\|uci}; my $s = "uci$chr_utf8"; print $chr . " " . ($s =~ $rx ? "ok" : "not ok") . "\n"; }` [download] but if you add another character to the second alternative in the regex, e.g. `... my $rx = qr{uci$chr_byte\|uci_}; ...` [download] (the underscore shown here can be any char, it seems) the match suddenly fails in all cases tested (160..255) — but only if the leading 3 chars of the alternative are `"uci"`. Actually, there are a number of other weird cases, but I think I don't have to show them all here. :) As already mentioned in that thread, the problem seems to be related to the new trie code, because if you set `${^RE_TRIE_MAXBUF} = -1;` all weirdness disappears.	[reply] [d/l] [select]
Re^3: RegExp breaks in Perl 5.10 by jfraire (Beadle) on Mar 07, 2008 at 07:06 UTC
I followed this thread advice (as I said above) and uploaded the module converted into UTF-8. The good news is that so far, all reports in both 5.10.0 and 5.8.8 have passed. Bad news is that if you repeat your test but backwards, $latin =~ /utf8/, it also fails: `for my $chr (160 .. 255) { my $chr_byte = chr($chr); my $chr_utf8 = chr($chr); utf8::upgrade($chr_utf8); my $rx = qr{uci$chr_utf8\|uci_}; my $s = "uci$chr_byte"; print $chr . " " . ($s =~ $rx ? "ok" : "not ok") . "\n"; }` [download] Now that the module is UTF-8, I copied the test suite list of words into latin 1. As suggested by the test above, the new test suite fails. It fails for the same 15 words. So, is `${^RE_TRIE_MAXBUF} = -1;` the most general work-around? What implications does it have? What other options do I have? Thank you for your kind help.	[reply] [d/l] [select]
Re^4: RegExp breaks in Perl 5.10 by almut (Canon) on Mar 07, 2008 at 18:21 UTC
Re^5: RegExp breaks in Perl 5.10 by jfraire (Beadle) on Mar 07, 2008 at 20:02 UTC