Re: utf8 characters in tr/// or s///

in reply to utf8 characters in tr/// or s///

When you fetch utf8 texrt from mysql, you should always run it through Encode::decode("utf8",...) -- update: or equivalent, as shown by ikegami -- so that perl has a valid utf8 string with the "utf8" flag turned on. Then, you can do lots of useful things using normal perl string operations.

For example, here's a neat and easy way to eliminate all diacritic marks that come attached to ~~ascii~~ Latin alphabetic letters:

use Encode qw/decode is_utf8/;
use Unicode::Normalize;

# let $string be value that was just fetched from a utf8 database fiel
+d,
#  in which case, you will most likely need to do this:

$string = decode( "utf8", $string );

# or just for testing, comment out the previous line, and
# $string = join( "", map{chr()} 0xc0..0xff ); # uncomment this line

# NFD normalization splits off all diacritic marks as separate code po
+ints
# and these "combining" marks for latin are in the U0300-U036F range

( $string_nd = NFD( $string )) =~ tr/\x{300}-\x{36f}//d;

binmode STDOUT, ":utf8";   # just to be sure this has been done

print "original: << $string >>\n";
print "  edited: << $string_nd >>\n";
[download]

Alas, that form of normalization does not convert "ø" to "o", or "Æ" to "AE", or "ß" to "ss", etc. That is, there may still be non-ascii characters in the final result, depending on what you have in your database, and for stuff like that, you'll just have to face the task of defining what sort of behavior you really want (e.g. just strip them out, or define an explicit list of replacements, or...)

In case it might help, it's easy to get an inventory of the characters you have in the database, so that you can see which ones, if any, need special attention beyond just stripping diacritic marks. I posted a little tool here that shows one way to do that: unichist -- count/summarize characters in data.

One other caveat about that normalization process: for a number of languages (e.g. those that use Arabic, Hebrew, Devanagari, or other non-Latin scripts with diacritic marks), you may want/need to apply "NFC" normalization (also provided by Unicode::Normalize) after doing "NFD" and Latin diacritic removal, so that you "recompose" the non-Latin characters and diacritics into their "canonical" combined-character forms.

(update; having just seen ikegami's point about the "utf8::" functions, I agree -- that's a fine alternative to "use Encode".)

Comment on Re: utf8 characters in tr/// or s/// Download Code

Replies are listed 'Best First'.
Re^2: utf8 characters in tr/// or s/// by b10m (Vicar) on Oct 05, 2008 at 20:03 UTC
When you fetch utf8 texrt from mysql, you should always run it through Encode::decode("utf8",...) -- update: or equivalent, as shown by ikegami When using a fairly recent version of DBD::mysql, you can use the `mysql_enable_utf8` option. Or, to quote: This attribute determines whether DBD::mysql should assume strings stored in the database are utf8. This feature defaults to off. When set, a data retrieved from a textual column type (char, varchar, etc) will have the UTF-8 flag turned on if necessary. This enables character semantics on that string. You will also need to ensure that your database / table / column is configured to use UTF8. See Chapter 10 of the mysql manual for details. Additionally, turning on this flag tells MySQL that incoming data should be treated as UTF-8. This will only take effect if used as part of the call to connect(). If you turn the flag on after connecting, you will need to issue the command SET NAMES utf8 to get the same effect. This option is experimental and may change in future versions. and yes, this is experimental, but seemed to work fairly stable in my tests. -- b10m	[reply]
Re^3: utf8 characters in tr/// or s/// by graff (Chancellor) on Oct 06, 2008 at 02:31 UTC
This is good to know -- thanks++! Based on the description, it sounds like it may be a while before this sort of facility becomes "normal", to the extent that folks would find transitioning to it to be easier than staying with the older approach. The situation reminds me of a Larry Wall quote (in the perlunicode mail list, wouldn't you know) -- this was four years ago, but it still resonates: Perl's always been about providing reasonable defaults, and will continue to do so. But changing what's reasonable is tricky, and sometimes you have to go through a period in which nothing can be considered reasonable.	[reply]
Re^2: utf8 characters in tr/// or s/// by MattLG (Sexton) on Oct 01, 2008 at 20:35 UTC
Brilliant! You guys RULE! Thanks. MattLG	[reply]
Re^3: utf8 characters in tr/// or s/// by MattLG (Sexton) on Oct 04, 2008 at 17:28 UTC
And one other thing that I'm finding conflicting advice for on the internet is packing the incoming data from CGI into utf8. I currently use: `$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;` [download] against the strings that come in via the web. Now I see that there's a "U" template for unicode. But I'm after UTF8, so that doesn't quite fit, and I don't understand what the pack docs are saying about UTF-8. However, in a couple of places I've searched I've found this: `$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; utf8::decode($value);` [download] which I don't really understand. I'd assumed the "C" would put everything into ASCII/ISO-8859-1 and utf8::decoding that would just produce garbage out of the special characters. What would the monks advise? Cheers MattLG	[reply] [d/l] [select]
Re^4: utf8 characters in tr/// or s/// by graff (Chancellor) on Oct 06, 2008 at 02:20 UTC
If the stuff coming in from your web clients is using the "%XX" notation for utf8 character data, then any "wide" characters (requiring more than one byte in utf8) will require one "%XX" thingie per byte (e.g. a utf8 "ÿ" (U+00FF) would be "%C3%BF"). If you see that in your input, then `pack("C",...)` is the right thing as the first step: it creates the appropriate byte sequence for the intended utf8 character. The utf8::decode() step then handles the necessary step of getting perl to acknowledge that the given byte sequence should be treated as a utf8 character.	[reply] [d/l]

In Section Seekers of Perl Wisdom