I almost never deal with people’s names; I think doing so is pretty rare.
However, except for the Mc/Mac thing, whose solution I didn’t actually include because it requires more explanation than I cared to go into then (you need to supply a preprocess transform to the constructor), all the rest is pretty standard text stuff.
Almost the only time I sort stuff is for output. Otherwise order is probably immaterial. And if I’m sorting for output, I want to be able to look at text in the order that English has used for hundreds of years to sort its text. I do not want to look at things in random code order; down that road lies madness.
You can look at the the default UCA sort as a very good approximation to historical order of text. No, it isn’t perfect, but everyone is equally disadvantaged. For example, it doesn’t do the English Mc/Mac trick unassisted.
What is does do, though, seems to be the logical thing for text in a way that code point order never is. Explaining it makes it seem harder than it is, but it isn’t hard; it’s what makes sense for text, and it’s how we’ve done it forever. This is a little bit of a simplification, but it works essentially this way:
- 1. Primary strength
- Compare to see whether the basic letters are the same. Ignore nonletters at this stage; just skip ahead till you find a letter. If the letters aren’t the same for the same relative position, there is an established dictionary order about what goes first. If you are a user of the Latin alphabet, this will be in the order of the abc’s you learned in school, so Fred comes before freedom, as does free beer. The reason it put free beer in front of freedom is because the fifth letter in the first string is b, and that comes before the fifth letter in the second string, which is d. See how that works? That’s dictionary order. We aren’t doing a field sort here.
- 2. Secondary strength
- If the letters are the same, then check whether the diacritics are the same. By default we resolve ties by looking at the diacritics reading left to right, but this can be flipped to do so right to left to keep the French less unhappy. (The classic demo is that normal LTR tie‐breaking order sorts cote < coté < côte < côté, whereas the French RTL tie‐breaking order for diacritics sorts cote < côte < coté < côté. Yes, really, and I’m sorry, but it’s truly not my fault. It has to do with their inflectional morphology, which is tail‐based.)
- 3. Tertiary strength
- If the letters and the diacritics are the same, then check whether the case is the same. By default, lowercase precedes uppercase, but this is easy to flip.
- 4. Quaternary strength
- If the letters, the diacritics, and the case are all the same for a given position, now you may go back and reconsider any nonletters, like punctuation and symbols and whitespace.
You don’t have to do all those if you don’t want. You can for example tell it to use only the primary strength, which only considers basic letters and absolutely nothing else, even case. (That’s how you do an “accent‐insensitive” string comparison, BTW, using your collator object’s eq method.) If you wanted it to ignore case but consider accents for level one ties, just set it to do only the first two stages and skip the rest.
You can even define more levels that the standard four if you want, and all kinds of things can be tweaked, but the default is usually good enough.
This works how pretty well for English text, and it still works even if it has non‐Latin mixed in with it here and there, since Greek letters would sort within their own alphabet, &c. Strings like Joe Bob,
Joe‐Bob,
Joebob, and
Jo E. Bob are all the same at the first two levels, and only diverge after that, with case differences at the third level and nonletters at the fourth. Yes, this really does make it easier to read lists of text. I have processed thousands and thousands of output reports of English text this way, and it really does.
If you don’t like how it by default ignores nonletters (which is a bit misleading of me to put it that way, since numbers come before letters and such; it isn’t as stupid as I may have made it sound) at the primary comparison stage, that is tweakable.
I guess one way to think of the default UCA sort in Perl pseudocode would be:
primary($a) <=> primary($b)
||
secondary($a) <=> secondary($b)
||
tertiary($a) <=> tertiary($b)
||
quaternary($a) <=> quaternary($b)
Like the code above, the multilevel sort short circuits when it finds a difference between its operands at the level. You can set it to do only N levels if you want.
That presumes that those functions are defined to return the right sort of magic number. It doesn’t account for how the default ignorables get skipped unless you manage to fall through to the fourth level. Remember that letter positions aren’t the same as equivalent indices of arrays.
I bet I’ve managed to convince everyone that nobody needs anything as complicated as this and that you should all go back to garbage ordering. That certainly isn’t my intent. If you just try the collator object’s default sort on text the next time you have to sort some text, without any tweaking or anything, I think you will be aesthetically pleased with the results. More pleased, in fact, than you are(n’t) by the code point garbage order you get with the unmodified builtin sort.
In other words: “Try it, you’ll like it!”
- DISCLAIMER:
-
As always, your mileage may vary — except of course in Europe where this statement is illegal, in which case you didn’t read it anyway so no harm no foul.
| [reply] [d/l] |
I thought about it a little more, and decided to revise my answer a bit so I could provide three scenarios where a bare sort might be called for, and also to give a shallower ramp‐up before plunging right into it all. | [reply] |
I think that the changes are an improvement. But only a slight one.
Essentially, I think by overstating the case, you are diluting its impact.
You use emotive terms: "garbage ordering". You make it sound as if you believe that 99% of all uses of sort should use your module and pour scorn on the idea that anyone might not. All this despite that people have been successfully -- enough for virtually all requirements -- been sorting data, including text, for 30 years without it.
For that, (IMO) very small -- perhaps 1% or 2% -- of cases where people might actually be offended by the relative ordering of accented characters, your module may be valuable. The rest of the time it is barely more than an expensive nice-to-have.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Then ignore the matter of letters with diacritics for now, since you seem either not to use them or else not to care if they change something from being the same letter and it send scurrying off to a completely different segment of your output.
Instead, just look at a classic “dictionary sort”, where you fold/ignore case and ignore everything but alphanumerics. That is the way phonebooks and card catalogues have historically been ordered in English, since way back before computers even existed. It is useful.
You can kind of get that running the shell sort -dfu
command, but that program is so obsessively oriented toward whitespace separated fields that you need to trick it into using a nonexistent separator, like here using Control-C:
$ perl -nle 's/^\s*#\s*define\s+// and print' perl/*.h | sort -t^C -df
(output excerpts inside the <readmore>)
See how useful that is? That’s why we’ve done it that way for hundreds of years, because it helps. You certainly don’t need Unicode to demonstrate this principle, as it applies to any text no matter the character repertoire.
I should probably have been clearer about this, because it is an important point:
In these days of bizarre branding using eyecatching typography to get your attention like “CamelCase” and “StUdLyCaPs”, free variation in spacing and hyphenation like “Post-it® Notes” and “postit notes”, and trademarks with non‐ASCII in them like “Häagen‐Dazs®” with its meaningless diacritic or “Encyclopædia Britannica” with its old‐school ligature, it is perhaps more important today than ever before that we have easy access to collation algorithms able to treat “FOOBAR”, “__foobar__”, “Foo Bar”, “foo-bar”, and “ƒöøbɐƦ” as minor variations of the same underlying sequence of six basic letters.
It is ultra‐useful to be able to think of things that ignore things like diacritics, casing, and non‐alphanumerics. There is nothing new here, since this is has historically been one ordering option frequently used by lexicographers, and it remains completely useful today. With more characters in our repertoires that ever before, it’s much harder to group them the way manual sorters have always done it in the past. But we still want to do so.
Those are some of the appeals of the Unicode::Collate module for sorting text. I’ll have to think about how to get this major point across more effectively, because I don’t seem to have done so yet.
| [reply] [d/l] [select] |
| [reply] |