comment on

When you fetch utf8 texrt from mysql, you should always run it through Encode::decode("utf8",...) -- update: or equivalent, as shown by ikegami -- so that perl has a valid utf8 string with the "utf8" flag turned on. Then, you can do lots of useful things using normal perl string operations.

For example, here's a neat and easy way to eliminate all diacritic marks that come attached to ~~ascii~~ Latin alphabetic letters:

use Encode qw/decode is_utf8/;
use Unicode::Normalize;

# let $string be value that was just fetched from a utf8 database fiel
+d,
#  in which case, you will most likely need to do this:

$string = decode( "utf8", $string );

# or just for testing, comment out the previous line, and
# $string = join( "", map{chr()} 0xc0..0xff ); # uncomment this line

# NFD normalization splits off all diacritic marks as separate code po
+ints
# and these "combining" marks for latin are in the U0300-U036F range

( $string_nd = NFD( $string )) =~ tr/\x{300}-\x{36f}//d;

binmode STDOUT, ":utf8";   # just to be sure this has been done

print "original: << $string >>\n";
print "  edited: << $string_nd >>\n";
[download]

Alas, that form of normalization does not convert "ø" to "o", or "Æ" to "AE", or "ß" to "ss", etc. That is, there may still be non-ascii characters in the final result, depending on what you have in your database, and for stuff like that, you'll just have to face the task of defining what sort of behavior you really want (e.g. just strip them out, or define an explicit list of replacements, or...)

In case it might help, it's easy to get an inventory of the characters you have in the database, so that you can see which ones, if any, need special attention beyond just stripping diacritic marks. I posted a little tool here that shows one way to do that: unichist -- count/summarize characters in data.

One other caveat about that normalization process: for a number of languages (e.g. those that use Arabic, Hebrew, Devanagari, or other non-Latin scripts with diacritic marks), you may want/need to apply "NFC" normalization (also provided by Unicode::Normalize) after doing "NFD" and Latin diacritic removal, so that you "recompose" the non-Latin characters and diacritics into their "canonical" combined-character forms.

(update; having just seen ikegami's point about the "utf8::" functions, I agree -- that's a fine alternative to "use Encode".)

In reply to Re: utf8 characters in tr/// or s/// by graff
in thread utf8 characters in tr/// or s/// by MattLG

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.