jfraire has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have (what I thought was) a really simple RegExp working under Perl 5.8.8 but it breaks when tested under 5.10.0. It is part of Lingua::Stem::Es and it is guilty of a lot of the failures reported for the current version in CPAN.

The ofending code is:

if ( ($suffix) = $R2 =~ /(uciones|ución)$/ ) { # ución uciones # replace with u if in R2 $word =~ s/$suffix$/u/; print "Step 1 case 4: $word\n" if $DEBUG; }

I expect it to match when $R2 ends in either "uciones" or "ución", but it fails to match when $R2='ución'. There are 15 such failures in the test suite, related to these words:

and other ten words all ending in "ución".

When $R2 contains "uciones" the RegExp works OK; there are 10 such examples in the test suite.

I would appreciate it if someone could offer some insight into why this is happening. If you'd like to try the module, there is an undocumented $DEBUG global var that, if set, will display the different steps where the word is being stemmed.

(The other reason why some tests failed is because I forgot to declare Test::Exception as a requirement).

Thanks in advance,

Julio

Replies are listed 'Best First'.
Re: RegExp breaks in Perl 5.10
by almut (Canon) on Mar 06, 2008 at 19:42 UTC

    Adding use encoding "iso-8859-1"; (to explicitly tell Perl that your source is in iso-8859-1) at the top of your module did fix it for me:

    $ make test PERL_DL_NONLAZY=1 /usr/local/perl/5.10.0/bin/perl "-MExtUtils::Command +::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t t/Lingua-Stem-Es....ok + All tests successful. Files=1, Tests=28379, 11 wallclock secs ( 8.21 cusr + 0.30 csys = 8. +51 CPU)
      But note that still perl 5.10.0 is broken...

        ...seems like it, yes. At least the test code from the thread you linked to still produces lots of "not ok"s with 5.10.0.  But I'll leave wiser heads than mine to comment on that... :)

Re: RegExp breaks in Perl 5.10
by grinder (Bishop) on Mar 06, 2008 at 20:24 UTC

    Hmm, I was sufficiently surprised by this behaviour (that I've not heard of before) that I went looking. First off, your code fragment is not much use, as it does not define what $R2 contains. So I went and looked at the source, and ripped the following out of its guts:

    use strict; use warnings; my @word = qw( constituci\xf3n contribuci\xf3n destituci\xf3n devoluci\xf3n dismi +nuci\xf3n constituciones contribuciones destituciones devoluciones disminuci +ones foo ); my $vowels = 'aeiou\xe1\xe9\xed\xf3\xfa\xfc'; my $consonants = 'bcdfghjklmn\xf1pqrstvwxyz'; my $revowel = qr/[$vowels]/; my $reconsonants = qr/[$consonants]/; my $R2; my $suffix; for my $word (@word) { ($R2) = $word =~ /^.*?$revowel$reconsonants.*?$revowel$reconsonant +s(.*)$/; $R2 ||= ''; if ( ($suffix) = $R2 =~ /(uciones|uci\xf3n)$/ ) { # uci\xf3n uciones # replace with u if in R2 $word =~ s/$suffix$/u/; print "Step 1 case 4: $word\n"; } }

    (Those \xnn characters really are Latin-1 characters, that's just a direct cut'n'paste from my shell introducing the artifact).

    And that runs just fine here, all the way up to "perl, v5.11.0 DEVEL33323 built for i386-freebsd-64int". So there's something else going on. Both "ución" and "uciones" match just fine. Perhaps the tester platforms are running in a different locale. To play it safe, I suggest you encode your program in UTF-8 and slap a use utf8 at the top and be done with it. At least I think that's the correct best practice. Thinking about encoding makes my head explode.

    • another intruder with the mooring in the heart of the Perl

      I think the issue with the module's original code is that the one side of the match has been decoded from UTF-8 (the word list from the file) while the other is in Latin1 (the literal strings in the source). In your test case, both are in Latin1, so they match.

      When adding (at the beginning of the loop)

      $word = Encode::decode("iso-8859-1", $word); # force utf8 flag on print "$word:\n";

      I can reproduce the problem, i.e. when forcing utf8, I get

      constitución:
      contribución:
      destitución:
      devolución:
      disminución:
      constituciones:
      Step 1 case 4: constitu
      contribuciones:
      Step 1 case 4: contribu
      destituciones:
      Step 1 case 4: destitu
      devoluciones:
      Step 1 case 4: devolu
      disminuciones:
      Step 1 case 4: disminu
      foo:
      

      while with your original test, the output is

      constitución:
      Step 1 case 4: constitu
      contribución:
      Step 1 case 4: contribu
      destitución:
      Step 1 case 4: destitu
      devolución:
      Step 1 case 4: devolu
      disminución:
      Step 1 case 4: disminu
      constituciones:
      Step 1 case 4: constitu
      contribuciones:
      Step 1 case 4: contribu
      destituciones:
      Step 1 case 4: destitu
      devoluciones:
      Step 1 case 4: devolu
      disminuciones:
      Step 1 case 4: disminu
      foo:
      
      If there's no "use locale" in the script then it should be not locale-dependent.
Re: RegExp breaks in Perl 5.10
by eserte (Deacon) on Mar 06, 2008 at 19:38 UTC
Re: RegExp breaks in Perl 5.10
by jfraire (Beadle) on Mar 06, 2008 at 22:54 UTC

    It strikes me as odd that of 37 CPAN Testers' reports, 7 passed OK and they are all 5.8.8. Those failed reports for 5.8.8 are because I did not declare Test::Exception as a pre-requisite.

    Moreover, the module is heavy on regexes. It is also really odd that this is the only one affected, and only one half of it! Out of more than 28K words, only 15 fail... I wouldn't think this is related to encoding.

    Since almut has fixed the module by declaring the encoding, I think that grinder's advice of moving to UTF-8 and declaring so is the most general solution. I have done it in my local copy of the module and the test suite still passes OK under my Perl 5.8.8 on Linux.

    Update:7 passes and 28 fails.

      Out of more than 28K words, only 15 fail... I wouldn't think this is related to encoding.

      Yes, this is definitely odd, in particular as the ó character, which was causing problems in your case, was not one of the problem chars in Slaven Rezic's test code (which I'm re-posting here for easy reference):

      for my $chr (160 .. 255) { my $chr_byte = chr($chr); my $chr_utf8 = chr($chr); utf8::upgrade($chr_utf8); my $rx = qr{$chr_byte|X}i; print $chr . " " . ($chr_utf8 =~ $rx ? "ok" : "not ok") . "\n"; }

      Here, it was mainly uppercase letters where the match failed.

      Note that the matching is done case-insensitively (which you don't do in your module).  However, when you remove the 'i' from the qr{}, everything seems to work fine... So, I played around with this a bit more, and in fact it turned out the bug is highly dependent on context (which could explain why most of your regexes kept working).

      For example, this modified test code still works fine

      for my $chr (160 .. 255) { my $chr_byte = chr($chr); my $chr_utf8 = chr($chr); utf8::upgrade($chr_utf8); my $rx = qr{uci$chr_byte|uci}; my $s = "uci$chr_utf8"; print $chr . " " . ($s =~ $rx ? "ok" : "not ok") . "\n"; }

      but if you add another character to the second alternative in the regex, e.g.

      ... my $rx = qr{uci$chr_byte|uci_}; ...

      (the underscore shown here can be any char, it seems) the match suddenly fails in all cases tested (160..255) — but only if the leading 3 chars of the alternative are "uci". Actually, there are a number of other weird cases, but I think I don't have to show them all here. :)

      As already mentioned in that thread, the problem seems to be related to the new trie code, because if you set ${^RE_TRIE_MAXBUF} = -1; all weirdness disappears.

        I followed this thread advice (as I said above) and uploaded the module converted into UTF-8. The good news is that so far, all reports in both 5.10.0 and 5.8.8 have passed.

        Bad news is that if you repeat your test but backwards, $latin =~ /utf8/, it also fails:

        for my $chr (160 .. 255) { my $chr_byte = chr($chr); my $chr_utf8 = chr($chr); utf8::upgrade($chr_utf8); my $rx = qr{uci$chr_utf8|uci_}; my $s = "uci$chr_byte"; print $chr . " " . ($s =~ $rx ? "ok" : "not ok") . "\n"; }

        Now that the module is UTF-8, I copied the test suite list of words into latin 1. As suggested by the test above, the new test suite fails. It fails for the same 15 words.

        So, is ${^RE_TRIE_MAXBUF} = -1; the most general work-around? What implications does it have? What other options do I have?

        Thank you for your kind help.