telcontar has asked for the wisdom of the Perl Monks concerning the following question:
The answer to requirement 2), as of 5.6.0, is that if a regexp contains Unicode characters, the string is searched as a sequence of Unicode characters. Otherwise, the string is searched as a sequence of bytes.However, in Unicode, some character may be represented in more than one way, e.g. as a single code point, or as a combined character.
#!/usr/bin/perl my @t = ("A\x{300}B", # U+0300 GRAVE ACCENT "AA", "A\x{41}\x{300}", # U+0041 LATIN CAPITAL LETTER A "\x{55}\x{308}O", # U+0308 COMBINING DIAERESIS "\x{dc}U" # U+00DC capital U with DIAERESIS ); binmode(STDOUT, ':utf8'); for (@t) { print "MATCH: $_\n" if /\p{Lu}{2}/; }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: unicode combined characters in regular expressions
by graff (Chancellor) on Feb 01, 2007 at 18:37 UTC | |
|
Re: unicode combined characters in regular expressions
by almut (Canon) on Feb 01, 2007 at 15:09 UTC | |
|
Re: unicode combined characters in regular expressions
by Anonymous Monk on Feb 01, 2007 at 17:52 UTC |