in reply to Re^2: Regular Expressions, ignore case and unicode
in thread Regular Expressions, ignore case and unicode
Hmm, Do You think this is a perl bug?Yes, it was a Perl bug. There were issues with how case insensitive matching worked within bracketed character class ranges. Here is the demo that shows it used to not work, and now does.I'll try to test new perl version.
% head /tmp/t?
==> /tmp/t1 <==
use utf8;
print 'бБ' =~ /^[а-я]+$/i ? 'regexp ok' : 'regexp fail', "\n";
==> /tmp/t2 <==
use utf8;
print 'Бб' =~ /^[а-я]+$/i ? 'regexp ok' : 'regexp fail', "\n";
==> /tmp/t3 <==
use utf8;
print 'Бб' =~ /^[\x{430}-\x{44f}]+$/i ? 'regexp ok' : 'regexp fail', "\n";
% head /tmp/t? | uniquote
==> /tmp/t1 <==
use utf8;
print '\N{U+431}\N{U+411}' =~ /^[\N{U+430}-\N{U+44F}]+$/i ? 'regexp ok' : 'regexp fail', "\n";
==> /tmp/t2 <==
use utf8;
print '\N{U+411}\N{U+431}' =~ /^[\N{U+430}-\N{U+44F}]+$/i ? 'regexp ok' : 'regexp fail', "\n";
==> /tmp/t3 <==
use utf8;
print '\N{U+411}\N{U+431}' =~ /^[\x{430}-\x{44f}]+$/i ? 'regexp ok' : 'regexp fail', "\n";
% head /tmp/t? | uniquote -v
==> /tmp/t1 <==
use utf8;
print '\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC CAPITAL LETTER BE}' =~ /^[\N{CYRILLIC SMALL LETTER A}-\N{CYRILLIC SMALL LETTER YA}]+$/i ? 'regexp ok' : 'regexp fail', "\n";
==> /tmp/t2 <==
use utf8;
print '\N{CYRILLIC CAPITAL LETTER BE}\N{CYRILLIC SMALL LETTER BE}' =~ /^[\N{CYRILLIC SMALL LETTER A}-\N{CYRILLIC SMALL LETTER YA}]+$/i ? 'regexp ok' : 'regexp fail', "\n";
==> /tmp/t3 <==
use utf8;
print '\N{CYRILLIC CAPITAL LETTER BE}\N{CYRILLIC SMALL LETTER BE}' =~ /^[\x{430}-\x{44f}]+$/i ? 'regexp ok' : 'regexp fail', "\n";
% apply perl5.12.3 /tmp/t[123]
regexp ok
regexp fail
regexp ok
% apply perl5.14.0-RC1 /tmp/t[123]
regexp ok
regexp ok
regexp ok
The problem evaporates upon upgrading.
You can get the uniquote script demo’d above, plus several other (mostly) Unicode-oriented tools, from training.perl.com/scripts. Most are in varying states of pre-release-ness, but all of them do get used nearly daily.
|
|---|