Calling sort without a comparison function is quite often the wrong thing to do, even on plain text.

There are a couple things that sorting can do for you, but they won't necessarily happen if you supply your own comparison function.

Those of us accustomed to dealing solely with 7-bit English characters take advantage of these properties which fall out nicely with the default comparison function. For example, if I am looking for the word "macnulty" and I find "machinery" in the sorted output I know I must look later in the list, and I know that "mable" cannot appear following either. I further know that all occurrences of "machinery" appear together, so, having found "machinery" and something else, I will never again see "machinery" in the list. This is true of all prefixes of strings as well as complete strings.

When one ventures outside of 7-bit code points, as a good citizen of the world must, or into domain-specific applications, like surnames, the "proper" sort order may not preserve these properties. "machinery" may follow both "macnulty" and "mable". I might encounter "mcnulty" between two distinct occurrences of "macnulty" unless someone has been careful to tie-break nominally (an appropriate term in this case) identical surnames using something very like the default comparison function.

The comparison function is a contract, of sorts, between the producer of and the consumers of the sorted output. If the producer and consumers have different expectations, something unseemly is likely to occur. One of my first assignments was to sort white pages listings using rules that were so complex (where does "General John Smith Jr" sort relative to "Doctor John J Smith III"?) that I could be certain that the average white pages user had no clue what the rules actually were. Fortunately, most lists of names were short enough that users could spot the right name without understanding the sort order. Sorting has its place, but indexing on N-grams is proving to be a more user-friendly search mechanism.


In reply to Re^2: best sort by jpl
in thread best sort by ag4ve

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.