But tr/// is faster than s///, and also, not all characters with diacritic symbols are equal. For example, simply substituting "A~" with "," and "E`" with ",", means that "Sera'" and "Sere'" would both turn into "Ser,", and both be sorted as equals. That's probably not desired behavior.
In other words, simply turning anything "ugly" into a comma will lose important information needed for an accurate sort.
Remember that tr/ABC/A/ turns A, B, or C into A. It's the same as tr/ABC/AAA/. So in my example, I'm turning all of the "e"-like characters with diacritic symbols into an 'e'. I'm also retaining the original string so that in cases where Avo^ and Avo' would be sorted as equals, a defined order is retained.
| [reply] |
I think the biggest problem is that "E like characters" is human defined, there is no technological solution to determine what is "E like" and what is not. Therefor you basically always have to specify your transformation list by hand.
| [reply] |
| [reply] |
I'm looking at the original post now, and he's showing only that all the variants on 'e' should be considered equivalent. I don't know where you got the idea that all the variants of 'a' are also equivilant to all the variants of 'e'. He never said that. That would be really bizzarre alphabetization. He is wanting all variants of 'A' to be seen as 'a'. And all of the variants of 'E' to be seen as 'e'. When you convert everything ugly to commas, you don't get alphabetization at all, you get something really odd.
Maybe you're not seeing the characters with the proper encoding on your web browser. Your solution assums that all variants of A, all variants of E, O, U, etc., will all be substituted with commas. That's definately NOT what he's asking for, and probably not what he has in mind.
| [reply] |