in reply to Sorting utf-8

perldoc -f sort ... If SUBNAME or BLOCK is omitted, "sort"s in standard string comparison order.

perldoc perlop ... Equality Operators ...

Binary ``cmp'' returns -1, 0, or 1 depending on whether the left argument is stringwise less than, equal to, or greater than the right argument.

``lt'', ``le'', ``ge'', ``gt'' and ``cmp'' use the collation (sort) order specified by the current locale if use locale is in effect. See perllocale.

perldoc perllocale ...
SYNOPSIS @x = sort @y; # ASCII sorting order { use locale; @x = sort @y; # Locale-defined sorting order } @x = sort @y; # ASCII sorting order again


MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
** The Third rule of perl club is a statement of fact: pod is sexy.

Replies are listed 'Best First'.
Re: Re: Sorting utf-8
by Anonymous Monk on Apr 24, 2003 at 11:03 UTC
    >current locale if use locale is in effect. See perllocale.

    But locales and unicode don't mix well:

    perldoc perlunicode:

    "Use of locales with Unicode data may lead to odd results.
    Currently,Perl attempts to attach 8 bit locale info to characters
    in the range 0..255, but this technique is demonstrably incorrect for
    locales that use characters above that range when mapped into
    Unicode.  Perls Unicode support will also tend to run slower.  Use of
    locales with Unicode is discouraged."
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
    I would have thought Unicode::Collate is the correct way to go when sorting utf8 encoded data. But, looking at the docs, I couldnt make head or tail of it - it assumes you to know an awful lot about "Unicode Technical Standard #10"

    If your current locale is, say, es_ES, how do you actually instantiate the correct Unicode::Collate object for that locale?

      Looking at the docs (and guessing at the UTS#10), I'd say that the Unicode collation algorithm is locale independent: it is supposed to give a collation key for any Unicode string (the fact that they are encoded in utf-8 is immaterial, BTW)

      So I'd just use Unicode::Collate->new()->sort(@list)

      If you want to customize the results, then you'll have to understand the UTS#10, but otherwise it should "just work"

      -- 
              dakkar - Mobilis in mobile
      

      Most of my code is tested...

      Perl is strongly typed, it just has very few types (Dan)

        Hmm, pretty cool if that's how it works - from looking at the constructor arguments I was wondering if you'd need a hash defining per-locale options, but maybe not.

        I would pay a LOT of money for a book that explained Unicode in a non-geeky manner :)

      Thank you, and also Dakkar, plus all the other kind people who posted,

      The solution is indeed to use "Unicode::Collate" together with a file called "allkeys.txt". No locales needed.

      Just put "use unicode::collate" and add the lines:
      my %tailoring; my $Collator; $Collator = Unicode::Collate->new(%tailoring); @char = $Collator->sort(@char);

      and sorting works "automagically"; French is now totally correct; for the other character sets such as Greek it looks logical but I'll get our translators to check the order for me just in case there are still some quirks.

      However, Swedish and Finnish no longer sort correctly, because in those languages Ä, Ö etc are considered to come after "Z" so it looks like I'm going to have to do an if/else with "normal" sort and "collate". But who cares, I'm a huge step forward from where I was this morning.

      Thanks a lot guys, Anne
        Good to hear its kind of working for you!

        But there will be a correct way for handling Swedish and Finnish unicode collation, so I wouldn't start switching between sort and collate in such cases until you've exhausted the "correct" way(s?) of doing this.

        Perhaps you could ask SADAHIRO Tomoyuki, the Unicode::Collate author?

        It would certainly be good if you can post any reply you get here, since this is the kind of stuff perl developers are going to have to know more and more about - I'm of course talking about the folks who don't know it already :)

Re: Re: Sorting utf-8
by webelan (Acolyte) on Apr 24, 2003 at 12:53 UTC
    Hi, Thanks for your reply. I thought using locales might work, although I have never used them before. But when I ran
    use locale; print +(sort grep /\w/, map { chr() } 0..255), "\n";

    to find out exactly what kind of ordering I would get, my was it weird. These are just the first few characters:

    _01╣2▓3│456789aAß┴Ó└Ô┬õ─Ò├Õ┼µãbBcCþÃdD­ðeE

    Nonetheless, I tried the sort with
    #use locale; @char = sort(@char); #no locale;

    and I got the ordering

    A É B C D E

    for the French. Not quite what I expected. Now I'm going to take a look at "unicode::collate" as mentioned by another post.

    Thanks for your help though. It is definitely a steep learning curve.
    Anne