in reply to Re: best sort
in thread best sort

Calling sort without a comparison function is nearly always the wrong thing to do. However, I can think of two exceptions to this.

The first exception is when you have no other goal but to iterate in an order that is merely repeatable, even if that order should happen to completely arbitrary.

Do you have any statistics -- real or made up -- about the percentage of sorts performed by programs in Perl (or any or all programming languages) that do not fit the criteria of your first "exception"?

Because after ~30 years of programming in which my gut feel is that 90% or there about of my programs have included a sort for some purpose or another, I can remember just 2 occasions when language dictionary ordering was a goal. One was an internal phone directory. The other was the 6 million records of West African Examination Council University entrance examination results.

At ~300 people, the first was sufficiently short that whether 'Mac...' came before, after or adjacent to 'Mc...' was irrelevant.

In the latter, which encompasses names from 5 countries: Ghana, Liberia, Nigeria, Sierra Leone, and the Gambia, with all their combined mixes of ethnic, religious, tribal and imperial languages, cultures and backgrounds, meant that it was impossible to get them to agree on a single correct collation. Hence, the reports produced for each country, and even different areas within the same country, all had to be ordered differently according to a custom collation.

That was a good many years ago, but I guarantee that there isn't anything on CPAN that would have handled any single one of the dozen or so collation sequences required.

My personal gut-feel assessment is that for 95% (and probably more) uses of sort in programs, the actual collation ordering used is immaterial. What's yours?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^3: best sort
by tchrist (Pilgrim) on Aug 15, 2011 at 04:49 UTC
    I almost never deal with people’s names; I think doing so is pretty rare. However, except for the Mc/Mac thing, whose solution I didn’t actually include because it requires more explanation than I cared to go into then (you need to supply a preprocess transform to the constructor), all the rest is pretty standard text stuff.

    Almost the only time I sort stuff is for output. Otherwise order is probably immaterial. And if I’m sorting for output, I want to be able to look at text in the order that English has used for hundreds of years to sort its text. I do not want to look at things in random code order; down that road lies madness.

    You can look at the the default UCA sort as a very good approximation to historical order of text. No, it isn’t perfect, but everyone is equally disadvantaged. For example, it doesn’t do the English Mc/Mac trick unassisted.

    What is does do, though, seems to be the logical thing for text in a way that code point order never is. Explaining it makes it seem harder than it is, but it isn’t hard; it’s what makes sense for text, and it’s how we’ve done it forever.

    This is a little bit of a simplification, but it works essentially this way:

    1. Primary strength
    Compare to see whether the basic letters are the same. Ignore nonletters at this stage; just skip ahead till you find a letter. If the letters aren’t the same for the same relative position, there is an established dictionary order about what goes first. If you are a user of the Latin alphabet, this will be in the order of the abc’s you learned in school, so Fred comes before freedom, as does free beer. The reason it put free beer in front of freedom is because the fifth letter in the first string is b, and that comes before the fifth letter in the second string, which is d. See how that works? That’s dictionary order. We aren’t doing a field sort here.
    2. Secondary strength
    If the letters are the same, then check whether the diacritics are the same. By default we resolve ties by looking at the diacritics reading left to right, but this can be flipped to do so right to left to keep the French less unhappy. (The classic demo is that normal LTR tie‐breaking order sorts cote < coté < côte < côté, whereas the French RTL tie‐breaking order for diacritics sorts cote < côte < coté < côté. Yes, really, and I’m sorry, but it’s truly not my fault. It has to do with their inflectional morphology, which is tail‐based.)
    3. Tertiary strength
    If the letters and the diacritics are the same, then check whether the case is the same. By default, lowercase precedes uppercase, but this is easy to flip.
    4. Quaternary strength
    If the letters, the diacritics, and the case are all the same for a given position, now you may go back and reconsider any nonletters, like punctuation and symbols and whitespace.

    You don’t have to do all those if you don’t want. You can for example tell it to use only the primary strength, which only considers basic letters and absolutely nothing else, even case. (That’s how you do an “accent‐insensitive” string comparison, BTW, using your collator object’s eq method.) If you wanted it to ignore case but consider accents for level one ties, just set it to do only the first two stages and skip the rest.

    You can even define more levels that the standard four if you want, and all kinds of things can be tweaked, but the default is usually good enough. This works how pretty well for English text, and it still works even if it has non‐Latin mixed in with it here and there, since Greek letters would sort within their own alphabet, &c.

    Strings like Joe Bob, Joe‐Bob, Joebob, and Jo E. Bob are all the same at the first two levels, and only diverge after that, with case differences at the third level and nonletters at the fourth. Yes, this really does make it easier to read lists of text. I have processed thousands and thousands of output reports of English text this way, and it really does.

    If you don’t like how it by default ignores nonletters (which is a bit misleading of me to put it that way, since numbers come before letters and such; it isn’t as stupid as I may have made it sound) at the primary comparison stage, that is tweakable.

    I guess one way to think of the default UCA sort in Perl pseudocode would be:

    primary($a)    <=>   primary($b)  || secondary($a)  <=>   secondary($b)  || tertiary($a)   <=>   tertiary($b)  || quaternary($a) <=>   quaternary($b) 
    Like the code above, the multilevel sort short circuits when it finds a difference between its operands at the level. You can set it to do only N levels if you want. That presumes that those functions are defined to return the right sort of magic number. It doesn’t account for how the default ignorables get skipped unless you manage to fall through to the fourth level. Remember that letter positions aren’t the same as equivalent indices of arrays.

    I bet I’ve managed to convince everyone that nobody needs anything as complicated as this and that you should all go back to garbage ordering. That certainly isn’t my intent. If you just try the collator object’s default sort on text the next time you have to sort some text, without any tweaking or anything, I think you will be aesthetically pleased with the results. More pleased, in fact, than you are(n’t) by the code point garbage order you get with the unmodified builtin sort.

    In other words: “Try it, you’ll like it!”

    DISCLAIMER:
    As always, your mileage may vary — except of course in Europe where this statement is illegal, in which case you didn’t read it anyway so no harm no foul.
Re^3: best sort
by tchrist (Pilgrim) on Aug 16, 2011 at 02:58 UTC
    I thought about it a little more, and decided to revise my answer a bit so I could provide three scenarios where a bare sort might be called for, and also to give a shallower ramp‐up before plunging right into it all.

      I think that the changes are an improvement. But only a slight one.

      Essentially, I think by overstating the case, you are diluting its impact.

      You use emotive terms: "garbage ordering". You make it sound as if you believe that 99% of all uses of sort should use your module and pour scorn on the idea that anyone might not. All this despite that people have been successfully -- enough for virtually all requirements -- been sorting data, including text, for 30 years without it.

      For that, (IMO) very small -- perhaps 1% or 2% -- of cases where people might actually be offended by the relative ordering of accented characters, your module may be valuable. The rest of the time it is barely more than an expensive nice-to-have.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Then ignore the matter of letters with diacritics for now, since you seem either not to use them or else not to care if they change something from being the same letter and it send scurrying off to a completely different segment of your output.

        Instead, just look at a classic “dictionary sort”, where you fold/ignore case and ignore everything but alphanumerics. That is the way phonebooks and card catalogues have historically been ordered in English, since way back before computers even existed. It is useful.

        You can kind of get that running the shell sort -dfu command, but that program is so obsessively oriented toward whitespace separated fields that you need to trick it into using a nonexistent separator, like here using Control-C:

        $ perl -nle 's/^\s*#\s*define\s+// and print' perl/*.h | sort -t^C -df
        (output excerpts inside the <readmore>) See how useful that is? That’s why we’ve done it that way for hundreds of years, because it helps. You certainly don’t need Unicode to demonstrate this principle, as it applies to any text no matter the character repertoire. I should probably have been clearer about this, because it is an important point:
        In these days of bizarre branding using eyecatching typography to get your attention like “CamelCase” and “StUdLyCaPs”, free variation in spacing and hyphenation like “Post-it® Notes” and “postit notes”, and trademarks with non‐ASCII in them like “Häagen‐Dazs®” with its meaningless diacritic or “Encyclopædia Britannica” with its old‐school ligature, it is perhaps more important today than ever before that we have easy access to collation algorithms able to treat “FOOBAR”, “__foobar__”, “Foo Bar”, “foo-bar”, and “ƒöøbɐƦ” as minor variations of the same underlying sequence of six basic letters.
        It is ultra‐useful to be able to think of things that ignore things like diacritics, casing, and non‐alphanumerics. There is nothing new here, since this is has historically been one ordering option frequently used by lexicographers, and it remains completely useful today. With more characters in our repertoires that ever before, it’s much harder to group them the way manual sorters have always done it in the past. But we still want to do so.

        Those are some of the appeals of the Unicode::Collate module for sorting text. I’ll have to think about how to get this major point across more effectively, because I don’t seem to have done so yet.

        You use emotive terms.... [you] pour scorn...

        Well if that ain't the pot calling the lily black...