in reply to Re: best sort
in thread best sort

It is arguable that the most experienced body when it comes to dealing with the problems of different character sets, diacriticals and their influence on collation ordering is the European Union.

Here are the official EU collation sequences for its member countries:

As best I can tell, your invented ordering would be incorrect for every country excepting possibly the UK. In particular note how Denmark, Sweden and Finland order those characters with diacriticals at the end.

Over thirty years ago (or more, its not clear), realising that there is no way to resolve the disparate expectations of all the member countries, the EU took the pragmatic approach to solving this problem.

Using Accented and Other Special Characters in Searching

The EU Inventories contain data in all Community languages except Greek and many of these languages contain accented characters in their alphabet.

All words containing accented characters are displayed as such in both WinSPIRS and WebSPIRS. For the former, you may need to choose a font other than the default font if it does not support the ISO 8859-1 (Latin alphabet No. 1) character set (known elsewhere in this database compendium as ISO Latin-1) for display/printing. All words containing accented or foreign characters (as well as a to z and A to Z) are converted to their upper case equivalents and then indexed as such. The collating sequence chosen for all indices in all languages is that for ISO Latin-1 except that all terms beginning with a numeric character appear at end. This has been done to provide ease and consistency in a multi-lingual and multi-database (i.e. when two or more databases from different languages are selected for retrieval) environment.

The actual collating sequence or character order in all indices is:

-, ., A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, À, Á, Â, Ã, Ä, Å, Æ, Ç, È, É, Ê, Ë, Ì, Í, Î, Ï, Ñ, Ò, Ó, Ô, Õ, Ö, Ø, Ù, Ú, Û, Ü, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^3: best sort
by tchrist (Pilgrim) on Aug 23, 2011 at 13:28 UTC
    Several of those sequences are demonstrably wrong: they make a claim for a national collating sequence that is specifically counter to the ordering specifically prescribed by monolingual language books in that language. For example, ninuco would collate before niño in Spain, not after it.

    I refuse to take seriously some Stone Age directive to pretend everything is ISO 9959-1. That has no relevance to Unicode whatsoever. It is stupid and wrong, and indeed offensive. Your lists are useless to the point of risibility. We don't live in a Latin-1 world anymore, and indeed never did.

    What an utter waste of time.

      I refuse to take seriously some Stone Age directive to pretend everything is ISO 9959-1.

      And that is exactly what is wrong with Unicode. It was formulated by American Companies, for American Companies in their typical "We'll right the world's wrongs and they'll see the superiority of our edicts" mentality. You doubt this, look up HAN unification

      Well guess what! Whilst you're making shit up and trying foist your woefully incomplete -- you've still to answer how you're going to work the cyrillic, Indus, HAN and all the other non-roman scripts into your magical world of -- "unified collation ordering" upon the world, some of us having been getting on with the pragmatic process of getting things to work in the real world. For 30 years or more.

      Your lists are useless to the point of risibility. We don't live in a Latin-1 world anymore, and indeed never did.

      What an utter waste of time.

      Those lists and quotes are taken directly from the EU's website. Live, current, mandated by EU law and working across 27 countries, 500 million people, and 25% of the global economy.

      And you're missing the point. The text doesn't have to be restricted to or stored as Latin-1. It is just collated and indexed that way.

      Jump up and down all you like, we'll see whether your ideas still persist say 30 years from now. Like I said waaay back there, history will tell.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.