Locutus has asked for the wisdom of the Perl Monks concerning the following question:

Dear Unicode Gurus,

given a text file encoded in UTF-8 I have to replace each pair of combined characters (e.g., 0x61 0xCC 0x88 = LATIN SMALL LETTER A + COMBINING DIAERESIS) by the corresponding single pre-combined character (in the example: 0xC3 0xA4 = LATIN SMALL LETTER A WITH DIAERESIS) if such exists. This problem sounds like made for Perl but I haven't been able to find something useful in CPAN, yet. Can you point me to the right direction, please?

Best regards
Locutus

  • Comment on Conversion of combined into pre-combined Unicode characters

Replies are listed 'Best First'.
Re: Conversion of combined into pre-combined Unicode characters
by ikegami (Patriarch) on Mar 25, 2010 at 16:34 UTC

    Unicode::Normalize's NFC

    use charnames ':full'; use Unicode::Normalize qw( NFC ); sub dump_str { print(charnames::viacode(ord($_)), "\n") for split //, $_[0]; } $_ = "\N{LATIN SMALL LETTER A}\N{COMBINING DIAERESIS}"; dump_str($_); print("--\n"); $_ = NFC($_); dump_str($_);
    LATIN SMALL LETTER A COMBINING DIAERESIS -- LATIN SMALL LETTER A WITH DIAERESIS

      Wow, that was express help at warp speed!

      Thank you so much
      Locutus