in reply to utf8 characters in tr/// or s///
For example, here's a neat and easy way to eliminate all diacritic marks that come attached to ascii Latin alphabetic letters:
Alas, that form of normalization does not convert "ø" to "o", or "Æ" to "AE", or "ß" to "ss", etc. That is, there may still be non-ascii characters in the final result, depending on what you have in your database, and for stuff like that, you'll just have to face the task of defining what sort of behavior you really want (e.g. just strip them out, or define an explicit list of replacements, or...)use Encode qw/decode is_utf8/; use Unicode::Normalize; # let $string be value that was just fetched from a utf8 database fiel +d, # in which case, you will most likely need to do this: $string = decode( "utf8", $string ); # or just for testing, comment out the previous line, and # $string = join( "", map{chr()} 0xc0..0xff ); # uncomment this line # NFD normalization splits off all diacritic marks as separate code po +ints # and these "combining" marks for latin are in the U0300-U036F range ( $string_nd = NFD( $string )) =~ tr/\x{300}-\x{36f}//d; binmode STDOUT, ":utf8"; # just to be sure this has been done print "original: << $string >>\n"; print " edited: << $string_nd >>\n";
In case it might help, it's easy to get an inventory of the characters you have in the database, so that you can see which ones, if any, need special attention beyond just stripping diacritic marks. I posted a little tool here that shows one way to do that: unichist -- count/summarize characters in data.
One other caveat about that normalization process: for a number of languages (e.g. those that use Arabic, Hebrew, Devanagari, or other non-Latin scripts with diacritic marks), you may want/need to apply "NFC" normalization (also provided by Unicode::Normalize) after doing "NFD" and Latin diacritic removal, so that you "recompose" the non-Latin characters and diacritics into their "canonical" combined-character forms.
(update; having just seen ikegami's point about the "utf8::" functions, I agree -- that's a fine alternative to "use Encode".)
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: utf8 characters in tr/// or s///
by b10m (Vicar) on Oct 05, 2008 at 20:03 UTC | |
by graff (Chancellor) on Oct 06, 2008 at 02:31 UTC | |
Re^2: utf8 characters in tr/// or s///
by MattLG (Sexton) on Oct 01, 2008 at 20:35 UTC | |
by MattLG (Sexton) on Oct 04, 2008 at 17:28 UTC | |
by graff (Chancellor) on Oct 06, 2008 at 02:20 UTC |