"Use of locales with Unicode data may lead to odd results.
Currently,Perl attempts to attach 8 bit locale info to characters
in the range 0..255, but this technique is demonstrably incorrect for
locales that use characters above that range when mapped into
Unicode. Perls Unicode support will also tend to run slower. Use of
locales with Unicode is discouraged."
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I would have thought Unicode::Collate is the correct way to go when sorting utf8 encoded data. But, looking at the docs, I couldnt make head or tail of it - it assumes you to know an awful lot about "Unicode Technical Standard #10"
If your current locale is, say, es_ES, how do you actually instantiate the correct Unicode::Collate object for that locale? | [reply] |
Looking at the docs (and guessing at the UTS#10), I'd say that the Unicode collation algorithm is locale independent: it is supposed to give a collation key for any Unicode string (the fact that they are encoded in utf-8 is immaterial, BTW)
So I'd just use Unicode::Collate->new()->sort(@list)
If you want to customize the results, then you'll have to understand the UTS#10, but otherwise it should "just work"
--
dakkar - Mobilis in mobile
Most of my code is tested...
Perl is strongly typed, it just has very few types (Dan) | [reply] [d/l] |
| [reply] |
Thank you, and also Dakkar, plus all the other kind people who posted,
The solution is indeed to use "Unicode::Collate" together with a file called "allkeys.txt". No locales needed.
Just put "use unicode::collate" and add the lines:
my %tailoring;
my $Collator;
$Collator = Unicode::Collate->new(%tailoring);
@char = $Collator->sort(@char);
and sorting works "automagically"; French is now totally correct; for the other character sets such as Greek it looks logical but I'll get our translators to check the order for me just in case there are still some quirks.
However, Swedish and Finnish no longer sort correctly, because in those languages Ä, Ö etc are considered to come after "Z" so it looks like I'm going to have to do an if/else with "normal" sort and "collate". But who cares, I'm a huge step forward from where I was this morning.
Thanks a lot guys,
Anne | [reply] [d/l] |
Good to hear its kind of working for you!
But there will be a correct way for handling Swedish and Finnish unicode collation, so I wouldn't start switching between sort and collate in such cases until you've exhausted the "correct" way(s?) of doing this.
Perhaps you could ask SADAHIRO Tomoyuki, the Unicode::Collate author?
It would certainly be good if you can post any reply you get here, since this is the kind of stuff perl developers are going to have to know more and more about - I'm of course talking about the folks who don't know it already :)
| [reply] |
| [reply] |
Hi,
Thanks for your reply. I thought using locales might work, although I have never used them before. But when I ran
use locale;
print +(sort grep /\w/, map { chr() } 0..255), "\n";
to find out exactly what kind of ordering I would get, my was it weird. These are just the first few characters:
_01╣2▓3│456789aAß┴Ó└Ô┬õ─Ò├Õ┼µãbBcCþÃdDðeE
Nonetheless, I tried the sort with
#use locale;
@char = sort(@char);
#no locale;
and I got the ordering
A É B C D E
for the French. Not quite what I expected.
Now I'm going to take a look at "unicode::collate" as mentioned by another post.
Thanks for your help though. It is definitely a steep learning curve.
Anne | [reply] [d/l] [select] |