comment on

When you fetch utf8 texrt from mysql, you should always run it through Encode::decode("utf8",...) -- update: or equivalent, as shown by ikegami -- so that perl has a valid utf8 string with the "utf8" flag turned on. Then, you can do lots of useful things using normal perl string operations.

For example, here's a neat and easy way to eliminate all diacritic marks that come attached to ~~ascii~~ Latin alphabetic letters:

use Encode qw/decode is_utf8/;
use Unicode::Normalize;

# let $string be value that was just fetched from a utf8 database fiel
+d,
#  in which case, you will most likely need to do this:

$string = decode( "utf8", $string );

# or just for testing, comment out the previous line, and
# $string = join( "", map{chr()} 0xc0..0xff ); # uncomment this line

# NFD normalization splits off all diacritic marks as separate code po
+ints
# and these "combining" marks for latin are in the U0300-U036F range

( $string_nd = NFD( $string )) =~ tr/\x{300}-\x{36f}//d;

binmode STDOUT, ":utf8";   # just to be sure this has been done

print "original: << $string >>\n";
print "  edited: << $string_nd >>\n";
[download]

Alas, that form of normalization does not convert "ø" to "o", or "Æ" to "AE", or "ß" to "ss", etc. That is, there may still be non-ascii characters in the final result, depending on what you have in your database, and for stuff like that, you'll just have to face the task of defining what sort of behavior you really want (e.g. just strip them out, or define an explicit list of replacements, or...)

In case it might help, it's easy to get an inventory of the characters you have in the database, so that you can see which ones, if any, need special attention beyond just stripping diacritic marks. I posted a little tool here that shows one way to do that: unichist -- count/summarize characters in data.

One other caveat about that normalization process: for a number of languages (e.g. those that use Arabic, Hebrew, Devanagari, or other non-Latin scripts with diacritic marks), you may want/need to apply "NFC" normalization (also provided by Unicode::Normalize) after doing "NFD" and Latin diacritic removal, so that you "recompose" the non-Latin characters and diacritics into their "canonical" combined-character forms.

(update; having just seen ikegami's point about the "utf8::" functions, I agree -- that's a fine alternative to "use Encode".)

In reply to Re: utf8 characters in tr/// or s/// by graff
in thread utf8 characters in tr/// or s/// by MattLG

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl: the Markov chain saw
	PerlMonks