My question: is it possible to write regular expressions so that combined characters are treated as "single" characters by the regular expression engine?

The short answer is "no". In regular expressions, the term "character" refers to a given codepoint, whether it be a plain letter, a plain accent mark, a combining accent mark, or whatever. Any human-language-based interpretation of a codepoint sequence as one "linguistic" character has no direct status or support in regex syntax.

But as the other replies have pointed out, there are things you can do to accommodate codepoint sequences that make up "single letters" in the human-language sense: normalize the character data before applying regexes (i.e. replace codepoint sequences with single-character codepoints where possible, which is what Unicode::Normalize can do for you), and/or include expressions for "combining characters" in your regex, where necessary, using things like "\p{Mn}" (see perlunicode).


In reply to Re: unicode combined characters in regular expressions by graff
in thread unicode combined characters in regular expressions by telcontar

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.