My question: is it possible to write regular expressions so that combined characters are treated as "single" characters by the regular expression engine?
The short answer is "no". In regular expressions, the term "character" refers to a given codepoint, whether it be a plain letter, a plain accent mark, a combining accent mark, or whatever. Any human-language-based interpretation of a codepoint sequence as one "linguistic" character has no direct status or support in regex syntax.
But as the other replies have pointed out, there are things you can do to accommodate codepoint sequences that make up "single letters" in the human-language sense: normalize the character data before applying regexes (i.e. replace codepoint sequences with single-character codepoints where possible, which is what Unicode::Normalize can do for you), and/or include expressions for "combining characters" in your regex, where necessary, using things like "\p{Mn}" (see perlunicode).
In reply to Re: unicode combined characters in regular expressions
by graff
in thread unicode combined characters in regular expressions
by telcontar
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |