in reply to unicode [A-F] equivalent?

But this seems like a common problem,
Really? I don't think the problem is common at all. Texts that require words from multiple scripts are not common, and if they are used, it's typically single words or short phrases that are used, and certainly not indexed.

I don't think there's a canned solution that works for all. For instance, Chinese doesn't have the notion of "alphabetical" ordering of words - at least, not in the way we are used in the Western world. If you have a Chinese friend, ask him/her to explain how a Chinese dictionary works. I once did, and that was a learning experience. Your suggested solution will probably work if you have a handful of non-Western words - but does it scale if 70% of your list consists of Chinese and Korean words?

Replies are listed 'Best First'.
Re^2: unicode [A-F] equivalent?
by qq (Hermit) on Mar 22, 2005 at 14:07 UTC

    It would not scale at all well, I agree. Luckily I'm not creating a multi-lingual dictionary, but organizing a list of english language radio show names. Occaisionally an accented character will come through, but anything else will be a surprise.

    I did once work on an international Who's Who book. The ordering of names was "solved" by having romanized equivalents. But it was the editor's job to decide the order, not the mine.

      Even the accented letters are a problem. Accented letters often come from Western or Nothern European countries. Which all use the ISO LATIN-1 alphabet. But while an accented letter may look the same in different countries, they are different. In some languages, an accent just means the letter is pronounced differently, but it's still the same letter. But the same accent can change the letter in a different language. Which will become a different letter. And even if you have two languages who use the same accented letter, it doesn't necessarely mean they the letters sort the same.

      Which is why we have locales. And which means that whatever solution you will pick - there are people that will be surprised.

      If only we all spoke (and wrote) Egyptian hieroglyphs, we would have this mess.

        I do agree with you in the general case. But honestly, guv, there are extenuating circumstances here. Its a very low user admin interface, dealing with content that is basically english. The existing interface groups by english alphabetic ranges and would have silently dropped items that did not fit. I did suggest alternatives to the UI team, but they prefered to keep the existing interface. Regardless...

        I would like to hear better approaches, and minimize surprise to users. The problem splits into (at least) two parts: a) sorting a list that may contain non-english words. And b) grouping said list into groups that map somewhat to english character ranges. We can assume that the target audience is English speaking. Both of these may be basically impossible to do correctly for all cases. So whats the least surprising behaviour?