Re: Stuck in accent-land

Perl will do the right thing with data marked as utf8. The problem occurs when you have data not marked as utf8. Then use locale may be the answer; that will determine wordiness of characters based on your locale environment variables.

If you are stuck on a platform that only has the minimum C support for locales (such as cygwin) you need to upgrade the data to utf8 instead (by appending and removing a wide character or by utf8::upgrade). Excerpt from utf8::upgrade pod:

* $num_octets = utf8::upgrade($string) Converts (in-place) internal representation of string to Perl's internal UTF-X form. Returns the number of octets necessary to represent the string as UTF-X. Can be used to make sure that the UTF-8 flag is on, so that "\w" or "lc()" work as expected on strings containing characters in the range 0x80-0xFF (oon ASCII and derivatives). Note that this should not be used to convert a legacy byte encoding to Unicode: use Encode for that. Affected by the encoding pragma.

Comment on Re: Stuck in accent-land