comment on

Unicode defines a variety of normalization forms (see http://unicode.org/reports/tr15/).

I prefer normalization form NFKD, as it translates more ligatures (though not all, for example the ligature Œ).

First decompose composite characters into their component parts (e.g. letters and diacritical marks), then strip out the marks.

$str = Unicode::Normalize::NFKD($str);
$str =~ s/\p{NonspacingMark}//g;
[download]

Or with a full example:

## Demonstrate stripping of diacritical marks from Unicode strings
## April 2010, Bryce Nesbitt, Berkeley Electronic Press
## See also http://unicodelookup.com/
## See also http://en.wikipedia.org/wiki/Diacritic
## Keywords: perl, diacritic, diacritical
##           accent, iso-8859-1, normalization.
use utf8;                   # Tell perl source code is utf-8
use 5.10.0;
use Unicode::Normalize;

# Sample: "latin small letter e with circumflex and tilde" &#7877;
#         "latin small ligature ff" (will be expanded)
#         "latin small ligature oe" (won't be expanded)
$str = shift || "\x{1ec5} märks \x{fb00} \x{153}";
say "Input: ".debug_chatty_string($str);

# Decompose into letter and combining marks, in "Kompatibility" mode
$str = NFKD($str);
say "NFKD : ".debug_chatty_string($str);

# Remove combining marks
$str =~ s/\p{NonspacingMark}//g;
$str = lc($str);
say "Out  : ".debug_chatty_string($str);

sub debug_chatty_string
{
    my $outstring;
    # Use shift below, so utf-8 flag is preserved.
    # Else you might have to fiddle with Encode::_utf8_on()
    foreach $char (split //,shift) {
        my $ord = ord($char);
        if(($ord >= 32 && $ord < 127) || $ord == 10) {
            $outstring .= $char;
        } else {
            $outstring .= "<0x".sprintf("%x",$ord).">";
        }
    }
    return $outstring;
}
[download]

Example run:

Input: <0x1ec5> m<0xe4>rks <0xfb00> <0x153>
NFKD : e<0x302><0x303> ma<0x308>rks ff <0x153>
Out  : e marks ff <0x153>
[download]

Update: I really do mean normalization. ASCIIfying (e.g. encoding) would destroy non-latin text. Normalization preserves Greek, Hebrew, etc.

I am supporting clients in various languages who want the fuzzy matching that stripping diacriticals provides. It might make for the occasional confusion between German bears and bars... but that's much better than missing out on all the potential correct matches. For example in Hebrew vowels are not normally written except for children. Stripping the vowel and pronunciation diacriticals out lets you compare the text as an adult searcher will likely enter it.

In reply to Re: How do I normalize (e.g. strip) diacritical märks from a Unicode string? by brycen
in thread How do I normalize (e.g. strip) diacritical märks from a Unicode string? by Anonymous Monk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Do you know where your variables are?
	PerlMonks