utf8 change case on accented characters

JimmyMTL has asked for the wisdom of the Perl Monks concerning the following question:

I have lots of name in a UTF8 text file.
When I try to lowercase or uppercase them, any accented characters remain in their original case.

The following translation works fine for nuking a string (lowercase):

tr/AÁÀÂÅÄÃBCÇDEÉÈÊËFGHIÍÌÎÏJKLMNÑOÓÒÔÖÕPQRSTUÚÙÛÜVWXYŸZ/aáàâåäãbcçdeéè
+êëfghiíìîïjklmnñoóòôöõpqrstuúùûüvwxyÿz/;<br>
[download]

but it doesn't allow me to make cool use of things like \u for capitalizing words in a substitution.

I'm running PERL v5.8.9 built for darwin-2level on Mac OS X Leopard (standard distro).

I've got use UTF8;

My setlocale refuses to work - error message of "Undefined subroutine &main::setlocale called"

my system locales are all variations on LC_CTYPE="en_US.UTF-8" which may be hindering my adventure (the names are French)

I'm sure I'm not the first person to experience this behaviour - but a lot of googling has led to nothing but others with success by adding "use utf8;" (which I already had).

Advice? Ideas?

I don't want to have to iterate over every character in the string manually. The tr above is not elegant, but it works.

Thanks for any assistance you can provide!

Comment on utf8 change case on accented characters Download Code

Replies are listed 'Best First'.
Re: utf8 change case on accented characters by ikegami (Patriarch) on Sep 09, 2009 at 17:00 UTC
As answered in the CB, use `utf8::upgrade` on the string. The behaviour of some Perl ops currently depends on the internal encoding of the string. `utf8::upgrade` and `utf8::downgrade` alter the internal encoding of the string. `\u` and `\l` are implemented in terms of `uc` and `lc`, which are susceptible to this limitation/bug. For example, `$ perl -le'use open ":std", ":locale"; $_="\xE0 la plage"; utf8::downg +rade($_); print "\u$_"' à la plage $ perl -le'use open ":std", ":locale"; $_="\xE0 la plage"; utf8::upgra +de($_); print "\u$_"' À la plage` [download] I've got use UTF8; I hope you mean `use utf8;`, which simply tells Perl the source code that contains it is encoded using UTF-8 (not iso-latin-1). It doesn't affect IO. I have lots of name in a UTF8 text file. Did you decode the contents back into character? One way: `open(my $fh, '<:encoding(UTF-8)', $qfn) or die("Can't open file $qfn: $!\n");` [download] Don't forget to encoding on the way out. Undefined subroutine &main::setlocale called `setlocale` is from POSIX. Did you actually load the POSIX module and import `setlocale` from it? Update: Added example.	[reply] [d/l] [select]
Re^2: utf8 change case on accented characters by JimmyMTL (Initiate) on Sep 09, 2009 at 17:07 UTC
Thanks, ikegami I will do the utf8::upgrade and downgrade thing and see where that puts me. Yes, I do mean use utf8; although my code file has no BOM, it still seems to work. I'm so used to using setlocale on perl on our linux servers that I never even thought about why setlocale was available. Importing the module is always a good thing when it's not there by default. Again, many thanks, and I'll report the results with code samples for the benefit of future googlers and perl monastery residents alike...	[reply]
Re^3: utf8 change case on accented characters by ikegami (Patriarch) on Sep 09, 2009 at 17:35 UTC
Yes, I do mean use utf8; although my code file has no BOM, it still seems to work. Byte order is immutable with UTF-8, so the BOM is useless as a BOM with UTF-8. Some applications use it as a signal that the file is encoded using UTF-8, but Perl uses `use utf8;` for that. I'll report the results with code samples By the way, I added an example to my earlier post. If you're having problems, please use Devel::Peek and provide us a `Dump` of the string that's giving you problems.	[reply] [d/l] [select]
Re^3: utf8 change case on accented characters by ikegami (Patriarch) on Sep 09, 2009 at 17:37 UTC
Yes, I do mean use utf8; although my code file has no BOM, it still seems to work. Byte order is immutable with UTF-8, so the BOM is useless as a BOM with UTF-8. Some applications use it as a signal that the file is encoded using UTF-8, but Perl uses `use utf8;` for that. I'll report the results with code samples By the way, I added an example to my earlier post. If you're having problems, please use Devel::Peek and provide us a `Dump` of the string that's giving you problems.	[reply] [d/l] [select]