in reply to Re: Diacritic-Insensitive and Case-Insensitve Sorting
in thread Diacritic-Insensitive and Case-Insensitve Sorting

But tr/// is faster than s///, and also, not all characters with diacritic symbols are equal. For example, simply substituting "A~" with "," and "E`" with ",", means that "Sera'" and "Sere'" would both turn into "Ser,", and both be sorted as equals. That's probably not desired behavior.

In other words, simply turning anything "ugly" into a comma will lose important information needed for an accurate sort.

Remember that tr/ABC/A/ turns A, B, or C into A. It's the same as tr/ABC/AAA/. So in my example, I'm turning all of the "e"-like characters with diacritic symbols into an 'e'. I'm also retaining the original string so that in cases where Avo^ and Avo' would be sorted as equals, a defined order is retained.


Dave

  • Comment on Re: Re: Diacritic-Insensitive and Case-Insensitve Sorting

Replies are listed 'Best First'.
Re: Re: Re: Diacritic-Insensitive and Case-Insensitve Sorting
by BUU (Prior) on Jan 05, 2004 at 05:16 UTC
    I think the biggest problem is that "E like characters" is human defined, there is no technological solution to determine what is "E like" and what is not. Therefor you basically always have to specify your transformation list by hand.
Re: Re: Re: Diacritic-Insensitive and Case-Insensitve Sorting
by jweed (Chaplain) on Jan 05, 2004 at 05:21 UTC
    From the original node:
    For example, characters e E �© �? �ª �? �¨ should be considered equivalent
    So actually, my behavior is probably...desired behavior


    Who is Kayser Söze?
    Code is (almost) always untested.
      I'm looking at the original post now, and he's showing only that all the variants on 'e' should be considered equivalent. I don't know where you got the idea that all the variants of 'a' are also equivilant to all the variants of 'e'. He never said that. That would be really bizzarre alphabetization. He is wanting all variants of 'A' to be seen as 'a'. And all of the variants of 'E' to be seen as 'e'. When you convert everything ugly to commas, you don't get alphabetization at all, you get something really odd.

      Maybe you're not seeing the characters with the proper encoding on your web browser. Your solution assums that all variants of A, all variants of E, O, U, etc., will all be substituted with commas. That's definately NOT what he's asking for, and probably not what he has in mind.


      Dave

        Ah. Encoding problem indeed. Sorry, then. Nevermind.


        Who is Kayser Söze?
        Code is (almost) always untested.