Regular Expressions, ignore case and unicode

OlegG has asked for the wisdom of the Perl Monks concerning the following question:

Hi all.
I have some strange problems with my regexp. Here is a code describing the problem (б - the second letter in the Russian alphabet, Б - same letter in uppercase, а - the first letter in the alphabet, я - the last letter in the alphabet):

use utf8;
print 'бБ' =~ /^[а-я]+$/i ? 'regexp ok' : 'regexp fail', "\n";
__END__
regexp ok

Ok, expected result

use utf8;
print 'Бб' =~ /^[а-я]+$/i ? 'regexp ok' : 'regexp fail', "\n";
__END__
regexp fail

Not ok, very unexpected fail. Only the sequence of letters in the tested string was changed. Why fail?

use utf8;
print 'Бб' =~ /^[\x{430}-\x{44f}]+$/i ? 'regexp ok' : 'regexp fail', "\n";
__END__
regexp ok

Ok, \x{...} notation helps to avoid this strange problem.
So, WTF? How, to avoid problem without converting letters in regexp to \x{...} notation?
Thanks ;)

__DATA__
$uname
Linux
$perl -v
This is perl, v5.10.1 (*) built for x86_64-linux-gnu-thread-multi
$locale

LANG=ru_RU.UTF-8
LANGUAGE=
LC_CTYPE="ru_RU.UTF-8"
LC_NUMERIC="ru_RU.UTF-8"
LC_TIME="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
LC_MONETARY="ru_RU.UTF-8"
LC_MESSAGES="ru_RU.UTF-8"
LC_PAPER="ru_RU.UTF-8"
LC_NAME="ru_RU.UTF-8"
LC_ADDRESS="ru_RU.UTF-8"
LC_TELEPHONE="ru_RU.UTF-8"
LC_MEASUREMENT="ru_RU.UTF-8"
LC_IDENTIFICATION="ru_RU.UTF-8"
LC_ALL=
[download]

Comment on Regular Expressions, ignore case and unicode Download Code

Replies are listed 'Best First'.
Re: Regular Expressions, ignore case and unicode by moritz (Cardinal) on Apr 21, 2011 at 17:06 UTC
Use a new perl - I've tested it with perl-5.14.0-RC1, and it all works there. Perl 6 - second systems done right	[reply]
Re^2: Regular Expressions, ignore case and unicode by OlegG (Monk) on Apr 21, 2011 at 17:44 UTC
Hmm, Do You think this is a perl bug? I'll try to test new perl version.	[reply]
Re^3: Regular Expressions, ignore case and unicode by moritz (Cardinal) on Apr 21, 2011 at 18:22 UTC
Hmm, Do You think this is a perl bug? Yes. And fixed somewhere between 5.12.2 and 5.14.0-RC1. Curious fact: binmode STDOUT, ':encoding(UTF-8)'; print 'Бб' =~ /^(а-я+)/i ? "regexp ok '$1'" : 'regexp fail', "\n"; # PS: perlmonks does't really do Unicode, it's your second example # but without the $ Prints "regexp ok 'Бб'" - so it matched the whole string, and only the $ failed. Perl 6 - second systems done right	[reply]
Re^4: Regular Expressions, ignore case and unicode by tchrist (Pilgrim) on Apr 21, 2011 at 19:26 UTC
Re^4: Regular Expressions, ignore case and unicode by OlegG (Monk) on Apr 21, 2011 at 18:51 UTC
Re^5: Regular Expressions, ignore case and unicode by moritz (Cardinal) on Apr 21, 2011 at 21:30 UTC
Re^3: Regular Expressions, ignore case and unicode by tchrist (Pilgrim) on Apr 21, 2011 at 18:40 UTC
Hmm, Do You think this is a perl bug? I'll try to test new perl version. Yes, it was a Perl bug. There were issues with how case insensitive matching worked within bracketed character class ranges. Here is the demo that shows it used to not work, and now does. % head /tmp/t? ==> /tmp/t1 <== use utf8; print 'бБ' =~ /^[а-я]+$/i ? 'regexp ok' : 'regexp fail', "\n"; ==> /tmp/t2 <== use utf8; print 'Бб' =~ /^[а-я]+$/i ? 'regexp ok' : 'regexp fail', "\n"; ==> /tmp/t3 <== use utf8; print 'Бб' =~ /^[\x{430}-\x{44f}]+$/i ? 'regexp ok' : 'regexp fail', "\n"; % head /tmp/t? \| uniquote ==> /tmp/t1 <== use utf8; print '\N{U+431}\N{U+411}' =~ /^[\N{U+430}-\N{U+44F}]+$/i ? 'regexp ok' : 'regexp fail', "\n"; ==> /tmp/t2 <== use utf8; print '\N{U+411}\N{U+431}' =~ /^[\N{U+430}-\N{U+44F}]+$/i ? 'regexp ok' : 'regexp fail', "\n"; ==> /tmp/t3 <== use utf8; print '\N{U+411}\N{U+431}' =~ /^[\x{430}-\x{44f}]+$/i ? 'regexp ok' : 'regexp fail', "\n"; % head /tmp/t? \| uniquote -v ==> /tmp/t1 <== use utf8; print '\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC CAPITAL LETTER BE}' =~ /^[\N{CYRILLIC SMALL LETTER A}-\N{CYRILLIC SMALL LETTER YA}]+$/i ? 'regexp ok' : 'regexp fail', "\n"; ==> /tmp/t2 <== use utf8; print '\N{CYRILLIC CAPITAL LETTER BE}\N{CYRILLIC SMALL LETTER BE}' =~ /^[\N{CYRILLIC SMALL LETTER A}-\N{CYRILLIC SMALL LETTER YA}]+$/i ? 'regexp ok' : 'regexp fail', "\n"; ==> /tmp/t3 <== use utf8; print '\N{CYRILLIC CAPITAL LETTER BE}\N{CYRILLIC SMALL LETTER BE}' =~ /^[\x{430}-\x{44f}]+$/i ? 'regexp ok' : 'regexp fail', "\n"; % apply perl5.12.3 /tmp/t[123] regexp ok regexp fail regexp ok % apply perl5.14.0-RC1 /tmp/t[123] regexp ok regexp ok regexp ok The problem evaporates upon upgrading. You can get the uniquote script demo’d above, plus several other (mostly) Unicode-oriented tools, from training.perl.com/scripts. Most are in varying states of pre-release-ness, but all of them do get used nearly daily.	[reply]
Re: Regular Expressions, ignore case and unicode by Anonymous Monk on Apr 21, 2011 at 18:16 UTC
Bug appears to be present in perl v5.12.2	[reply]