Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Sorting according to locale collation

by betterworld (Curate)
on Apr 22, 2007 at 12:00 UTC ( [id://611356]=note: print w/replies, xml ) Need Help??


in reply to Sorting according to locale collation

According to Lithuanian rules this should have printed:
ia
ya
ib
yb
ic
yc

Hmm, I don't know why perl does not sort these correctly. But just out of curiosity: You said that "i" and "y" are treated the same. Would it still be right if you swap "ia" and "ya" in that list?

The sort function, when not given a code block, uses the "cmp" operator, which does use the locale according to perlop. Does the Unix utility sort(1) behave correctly?

  • Comment on Re: Sorting according to locale collation

Replies are listed 'Best First'.
Re^2: Sorting according to locale collation
by amir_e_a (Hermit) on Apr 22, 2007 at 13:40 UTC

    just out of curiosity: You said that "i" and "y" are treated the same. Would it still be right if you swap "ia" and "ya" in that list?

    I'm not Lithuanian - i just studied it a little in the University. From what i've seen in dictionaries and grammar books, when the letter following I/Y is the same, I comes before Y.

    Does the Unix utility sort(1) behave correctly?

    I tried running this:

    [root@sugarcube loc]# LC_COLLATE="lt_LT" [root@sugarcube loc]# export LC_COLLATE [root@sugarcube loc]# locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE=lt_LT LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= [root@sugarcube loc]# cat ia.txt ia ic ib ya yb yc [root@sugarcube loc]# sort ia.txt ia ib ic ya yb yc

    Looks like sort(1) did something, but not what i expected. I am not sure that i changed the locale correctly - i am not a Unix export. Any help will be appreciated.

      Looks like sort(1) prints the lines in the same order as Perl's sort does. So I guess the problem is that the locale itself does not treat i and y the same. (I don't know if that's possible at all.)

      According to perldoc perllocale, the locale answers the question "which of these letters comes first". I don't think that the answer "neither i nor y comes first, but i comes first if it is the only difference in the whole word" is allowed.

      What is the output if you add say

      ha
      ja

      to your test data set ?
Re^2: Sorting according to locale collation
by Anonymous Monk on Dec 06, 2011 at 01:30 UTC

    Lithuanian dictionary must be wrong as "i" and "y" are not the same. Have they documented the rule, is it self consistent and do the dictionary entries match? If the answer to any is no then randomise the listing for .arts sorts.

    What is critical for collation is that any character position is monotonic.

    LC_ALL=C (or at least LC_COLLATE=C) is the only legal value. Any other value is known to break strcoll(). Better to use safer strcmp().

    It should always compare by character numerical value. I.e. either byte value (US-ASCII, ISO-8859) or possibly UTF code point. The byte at a time is simpler and won't break existing applications.

    With EBCDIC 1047 it will never be alphabetical order, but it will be in order and able to be bsearched. UTF-8 byte at a time will also produce odd, but consistent results.

    Please get rid of i18n and l10n from at least the curses screen and command line. Other charsets are okay so long as they don't break sort, look, etc. As for GUIs with internal UTF-16 host endian buffers I don't care so long as they read and write UTF-8 to the system.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://611356]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2024-04-19 07:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found