Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

How do I normalize (e.g. strip) diacritical märks from a Unicode string?

by Anonymous Monk
on Apr 17, 2010 at 07:24 UTC ( [id://835238]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question: (strings)

How do I normalize a Unicode string, folding or removing all diacritical marks and accents? I wish to do this in preparation for saving words for later searches.

Replies are listed 'Best First'.
Re: How do I normalize (e.g. strip) diacritical märks from a Unicode string?
by brycen (Monk) on Apr 17, 2010 at 07:36 UTC

    Unicode defines a variety of normalization forms (see http://unicode.org/reports/tr15/).

    I prefer normalization form NFKD, as it translates more ligatures (though not all, for example the ligature Œ).

    First decompose composite characters into their component parts (e.g. letters and diacritical marks), then strip out the marks.

    $str = Unicode::Normalize::NFKD($str); $str =~ s/\p{NonspacingMark}//g;
    Or with a full example:
    ## Demonstrate stripping of diacritical marks from Unicode strings ## April 2010, Bryce Nesbitt, Berkeley Electronic Press ## See also http://unicodelookup.com/ ## See also http://en.wikipedia.org/wiki/Diacritic ## Keywords: perl, diacritic, diacritical ## accent, iso-8859-1, normalization. use utf8; # Tell perl source code is utf-8 use 5.10.0; use Unicode::Normalize; # Sample: "latin small letter e with circumflex and tilde" &#7877; # "latin small ligature ff" (will be expanded) # "latin small ligature oe" (won't be expanded) $str = shift || "\x{1ec5} märks \x{fb00} \x{153}"; say "Input: ".debug_chatty_string($str); # Decompose into letter and combining marks, in "Kompatibility" mode $str = NFKD($str); say "NFKD : ".debug_chatty_string($str); # Remove combining marks $str =~ s/\p{NonspacingMark}//g; $str = lc($str); say "Out : ".debug_chatty_string($str); sub debug_chatty_string { my $outstring; # Use shift below, so utf-8 flag is preserved. # Else you might have to fiddle with Encode::_utf8_on() foreach $char (split //,shift) { my $ord = ord($char); if(($ord >= 32 && $ord < 127) || $ord == 10) { $outstring .= $char; } else { $outstring .= "<0x".sprintf("%x",$ord).">"; } } return $outstring; }
    Example run:
    Input: <0x1ec5> m<0xe4>rks <0xfb00> <0x153> NFKD : e<0x302><0x303> ma<0x308>rks ff <0x153> Out : e marks ff <0x153>

    Update: I really do mean normalization. ASCIIfying (e.g. encoding) would destroy non-latin text. Normalization preserves Greek, Hebrew, etc.

    I am supporting clients in various languages who want the fuzzy matching that stripping diacriticals provides. It might make for the occasional confusion between German bears and bars... but that's much better than missing out on all the potential correct matches. For example in Hebrew vowels are not normally written except for children. Stripping the vowel and pronunciation diacriticals out lets you compare the text as an adult searcher will likely enter it.

Re: How do I normalize (e.g. strip) diacritical märks from a Unicode string?
by moritz (Cardinal) on Apr 17, 2010 at 07:36 UTC
    The trick is to split the letters with diacritical marks into the base letter and the mark, which Unicode::Normalize does with the NFD function. Then the regex /\pM/ identifies marking characters (see perlunicode).
    use strict; use warnings; use utf8; use Unicode::Normalize; my $s = "söme stüff\n"; $s = NFD($s); $s =~ s/\pM//g; print $s;

    Depending on the application, the NFKD might or might not be more appropriate than NFD.

    The code snippet above removes all marking characters, not just diacritical marks. You can change that by removing only \x{308}. The following code strips the diacritical mark, but leaves the accents:

    use strict; use warnings; use utf8; use Unicode::Normalize; binmode STDOUT, ':utf8'; my $s = "söme stüff with áccènts\n"; $s = NFD($s); $s =~ s/\x{308}//g; $s = NFC($s); print $s;
Re: How do I normalize (e.g. strip) diacritical märks from a Unicode string?
by ikegami (Patriarch) on Apr 17, 2010 at 17:26 UTC
    If by "normalize" you mean ASCIIfying text, this can be done using Text::Unidecode.
Re: How do I normalize (e.g. strip) diacritical märks from a Unicode string?
by afoken (Chancellor) on Apr 17, 2010 at 18:47 UTC

    I don't think that this kind of "normalizing" is a good idea. In German, "Bär" und "Bar" are two very different things, and stripping a Bär doesn't make entering it more attractive.

    Alexander

      This (like stemming) is often done for search engines .
Re: How do I normalize (e.g. strip) diacritical märks from a Unicode string?
by brycen (Monk) on Apr 20, 2010 at 18:22 UTC

    afoken: no, this is normalization. ASCIIfying (e.g. encoding) would destroy non-latin text. This method preserves Greek, Hebrew, etc.

    Alexander: I am supporting clients in various languages who want the fuzzy matching that stripping diacriticals provides. It might make for the occasional confusion between German bears and bars... but that's much better than missing out on all the potential correct matches. For example in Hebrew vowels are not normally written except for children. Stripping the vowel and pronunciation diacriticals out lets you compare the text as an adult searcher will likely enter it.

    Note that I prefer normalization form NFKD, as it translates more ligatures (though not all, for example the ligature Π)

    Originally posted as a Categorized Answer.

Re: How do I normalize (e.g. strip) diacritical märks from a Unicode string?
by Anonymous Monk on Apr 17, 2010 at 10:53 UTC
    Have you tried to see if the Unicode::Normalize module could solve your problem http://search.cpan.org/~sadahiro/Unicode-Normalize-1.06/Normalize.pm

    Originally posted as a Categorized Answer.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://835238]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (6)
As of 2024-03-29 14:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found