OlegG has asked for the wisdom of the Perl Monks concerning the following question:

Hi all.
I have some strange problems with my regexp. Here is a code describing the problem (б - the second letter in the Russian alphabet, Б - same letter in uppercase, а - the first letter in the alphabet, я - the last letter in the alphabet):

use utf8;
print 'бБ' =~ /^[а-я]+$/i ? 'regexp ok' : 'regexp fail', "\n";
__END__
regexp ok

Ok, expected result

use utf8;
print 'Бб' =~ /^[а-я]+$/i ? 'regexp ok' : 'regexp fail', "\n";
__END__
regexp fail

Not ok, very unexpected fail. Only the sequence of letters in the tested string was changed. Why fail?

use utf8;
print 'Бб' =~ /^[\x{430}-\x{44f}]+$/i ? 'regexp ok' : 'regexp fail', "\n";
__END__
regexp ok

Ok, \x{...} notation helps to avoid this strange problem.
So, WTF? How, to avoid problem without converting letters in regexp to \x{...} notation?
Thanks ;)

__DATA__
$uname
Linux
$perl -v
This is perl, v5.10.1 (*) built for x86_64-linux-gnu-thread-multi
$locale
LANG=ru_RU.UTF-8 LANGUAGE= LC_CTYPE="ru_RU.UTF-8" LC_NUMERIC="ru_RU.UTF-8" LC_TIME="ru_RU.UTF-8" LC_COLLATE="ru_RU.UTF-8" LC_MONETARY="ru_RU.UTF-8" LC_MESSAGES="ru_RU.UTF-8" LC_PAPER="ru_RU.UTF-8" LC_NAME="ru_RU.UTF-8" LC_ADDRESS="ru_RU.UTF-8" LC_TELEPHONE="ru_RU.UTF-8" LC_MEASUREMENT="ru_RU.UTF-8" LC_IDENTIFICATION="ru_RU.UTF-8" LC_ALL=

Replies are listed 'Best First'.
Re: Regular Expressions, ignore case and unicode
by moritz (Cardinal) on Apr 21, 2011 at 17:06 UTC
      Hmm, Do You think this is a perl bug?
      I'll try to test new perl version.
        Hmm, Do You think this is a perl bug?

        Yes. And fixed somewhere between 5.12.2 and 5.14.0-RC1. Curious fact:

        binmode STDOUT, ':encoding(UTF-8)';
        print 'Бб' =~ /^(а-я+)/i ? "regexp ok '$1'" : 'regexp fail', "\n";
        # PS: perlmonks does't really do Unicode, it's your second example
        # but without the $
        

        Prints "regexp ok 'Бб'" - so it matched the whole string, and only the $ failed.

        Hmm, Do You think this is a perl bug?

        I'll try to test new perl version.

        Yes, it was a Perl bug. There were issues with how case insensitive matching worked within bracketed character class ranges. Here is the demo that shows it used to not work, and now does.
        % head /tmp/t?
        ==> /tmp/t1 <==
        use utf8;
        print 'бБ' =~ /^[а-я]+$/i ? 'regexp ok' : 'regexp fail', "\n";
        
        ==> /tmp/t2 <==
        use utf8;
        print 'Бб' =~ /^[а-я]+$/i ? 'regexp ok' : 'regexp fail', "\n";
        
        ==> /tmp/t3 <==
        use utf8;
        print 'Бб' =~ /^[\x{430}-\x{44f}]+$/i ? 'regexp ok' : 'regexp fail', "\n";
        
        % head /tmp/t? | uniquote
        ==> /tmp/t1 <==
        use utf8;
        print '\N{U+431}\N{U+411}' =~ /^[\N{U+430}-\N{U+44F}]+$/i ? 'regexp ok' : 'regexp fail', "\n";
        
        ==> /tmp/t2 <==
        use utf8;
        print '\N{U+411}\N{U+431}' =~ /^[\N{U+430}-\N{U+44F}]+$/i ? 'regexp ok' : 'regexp fail', "\n";
        
        ==> /tmp/t3 <==
        use utf8;
        print '\N{U+411}\N{U+431}' =~ /^[\x{430}-\x{44f}]+$/i ? 'regexp ok' : 'regexp fail', "\n";
        
        % head /tmp/t? | uniquote -v
        ==> /tmp/t1 <==
        use utf8;
        print '\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC CAPITAL LETTER BE}' =~ /^[\N{CYRILLIC SMALL LETTER A}-\N{CYRILLIC SMALL LETTER YA}]+$/i ? 'regexp ok' : 'regexp fail', "\n";
        
        ==> /tmp/t2 <==
        use utf8;
        print '\N{CYRILLIC CAPITAL LETTER BE}\N{CYRILLIC SMALL LETTER BE}' =~ /^[\N{CYRILLIC SMALL LETTER A}-\N{CYRILLIC SMALL LETTER YA}]+$/i ? 'regexp ok' : 'regexp fail', "\n";
        
        ==> /tmp/t3 <==
        use utf8;
        print '\N{CYRILLIC CAPITAL LETTER BE}\N{CYRILLIC SMALL LETTER BE}' =~ /^[\x{430}-\x{44f}]+$/i ? 'regexp ok' : 'regexp fail', "\n";
        
        % apply perl5.12.3 /tmp/t[123]
        regexp ok
        regexp fail
        regexp ok
        
        % apply perl5.14.0-RC1 /tmp/t[123]
        regexp ok
        regexp ok
        regexp ok
        
        The problem evaporates upon upgrading.

        You can get the uniquote script demo’d above, plus several other (mostly) Unicode-oriented tools, from training.perl.com/scripts. Most are in varying states of pre-release-ness, but all of them do get used nearly daily.

Re: Regular Expressions, ignore case and unicode
by Anonymous Monk on Apr 21, 2011 at 18:16 UTC
    Bug appears to be present in perl v5.12.2