Re: Diacritic-Insensitive and Case-Insensitve Sorting

I dug deeper into the POD's and found that a variant on my original solution is actually discussed in the POD. See perlebcdic. The SORTING section is very educational.

From the POD:

MONO CASE then sort data. In order to minimize the expense of mono casing mixed test try to tr/// towards the character set case most employed within the data. If the data are primarily UPPERCASE non Latin 1 then apply tr/[a-z]/[A-Z]/ then sort(). If the data are primarily lowercase non Latin 1 then apply tr/[A-Z]/[a-z]/ before sorting. If the data are primarily UPPERCASE and include Latin-1 characters then apply:

tr/[a-z]/[A-Z]/; tr/[абвгдежзийклмнопрстуфхцшщъыьэю]/[АБВГДЕЖЗИЙКЛМНОПРСТУФХЦШЩЪЫЬЭ +Ю]/; s/Я/SS/g;
[download]

then sort(). Do note however that such Latin-1 manipulation does not address the я y WITH DIAERESIS character that will remain at code point 255 on ASCII machines, but 223 on most EBCDIC machines where it will sort to a place less than the EBCDIC numerals. With a Unicode enabled Perl you might try:

tr/^?/\x{178}/;
[download]

The strategy of mono casing data before sorting does not preserve the case of the data and may not be acceptable for that reason.

The POD method mentions the fact that transliteration will obliterate the original string's diacritic symbols and case. It's for that reason that I also used a Schwartzian Transform in my strategy. However, as Roger mentioned in the CB, that method is memory-expensive.

Of course what we're doing is more than sorting normalized case, it's also normalized diacritic symbols. But the idea is similar.

Dave

Comment on Re: Diacritic-Insensitive and Case-Insensitve Sorting Select or Download Code