unicode combined characters in regular expressions

telcontar has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

perlretut states:

The answer to requirement 2), as of 5.6.0, is that if a regexp contains Unicode characters, the string is searched as a sequence of Unicode characters. Otherwise, the string is searched as a sequence of bytes.

However, in Unicode, some character may be represented in more than one way, e.g. as a single code point, or as a combined character.
I assumed (probably wrongly) that in a regular expresion, perl would know a combined character to be "one" character. But it doesn't look like it:

#!/usr/bin/perl

my @t = ("A\x{300}B",       # U+0300 GRAVE ACCENT
         "AA",
         "A\x{41}\x{300}",  # U+0041 LATIN CAPITAL LETTER A
         "\x{55}\x{308}O",  # U+0308 COMBINING DIAERESIS 
         "\x{dc}U"          # U+00DC capital U with DIAERESIS
);

binmode(STDOUT, ':utf8');

for (@t) {
  print "MATCH: $_\n" if /\p{Lu}{2}/;
}
[download]

Uppercase characters represented by a single code point match. Combined characters match only if the combining mark isn't between the two characters. This is why
"A\x{300}B" doesn't match but "A\x{41}\x{300}" matches.

My question: is it possible to write regular expressions so that combined characters are treated as "single" characters by the regular expression engine?

Any help would be very appreciated.

Comment on unicode combined characters in regular expressions Select or Download Code

Replies are listed 'Best First'.
Re: unicode combined characters in regular expressions by graff (Chancellor) on Feb 01, 2007 at 18:37 UTC
My question: is it possible to write regular expressions so that combined characters are treated as "single" characters by the regular expression engine? The short answer is "no". In regular expressions, the term "character" refers to a given codepoint, whether it be a plain letter, a plain accent mark, a combining accent mark, or whatever. Any human-language-based interpretation of a codepoint sequence as one "linguistic" character has no direct status or support in regex syntax. But as the other replies have pointed out, there are things you can do to accommodate codepoint sequences that make up "single letters" in the human-language sense: normalize the character data before applying regexes (i.e. replace codepoint sequences with single-character codepoints where possible, which is what Unicode::Normalize can do for you), and/or include expressions for "combining characters" in your regex, where necessary, using things like "\p{Mn}" (see perlunicode).	[reply]
Re: unicode combined characters in regular expressions by almut (Canon) on Feb 01, 2007 at 15:09 UTC
I think you have to explicitly allow the mark, e.g. `/(\p{Lu}\p{M}){2}/ or /(\p{Lu}\p{Mn}){2}/ # "non-spacing" mark` [download] Update: Also, there's `\X` which matches a general "combining character sequence", though I wouldn't know how to specify the desired "letter"/"uppercase" property in that case... -- but you probably knew this already.	[reply] [d/l] [select]
Re: unicode combined characters in regular expressions by Anonymous Monk on Feb 01, 2007 at 17:52 UTC
I don't use Unicode much, but what I've read about Unicode::Normalize indicates it may help take some of the variability out of your Unicode strings.	[reply]