Then ignore the matter of letters with diacritics for now, since you seem either not to use them or else not to care if they change something from being the same letter and it send scurrying off to a completely different segment of your output.

Instead, just look at a classic “dictionary sort”, where you fold/ignore case and ignore everything but alphanumerics. That is the way phonebooks and card catalogues have historically been ordered in English, since way back before computers even existed. It is useful.

You can kind of get that running the shell sort -dfu command, but that program is so obsessively oriented toward whitespace separated fields that you need to trick it into using a nonexistent separator, like here using Control-C:

$ perl -nle 's/^\s*#\s*define\s+// and print' perl/*.h | sort -t^C -df
(output excerpts inside the <readmore>)
accept PerlSock_accept ACCEPT 96 /* 0x60 Accepts the current matched st +ring. */ access PerlLIO_access add_alternate(a,b,c) S_add_alternate(aTHX_ a,b,c) add_cp_to_invlist(a,b) S_add_cp_to_invlist(aTHX_ a,b) add_data S_add_data addmad(a,b,c) Perl_addmad(aTHX_ a,b,c) ADDOP 305 anchored_utf8 substrs->data[0].utf8_substr ANDAND 323 ANDOP 318 ANONSUB 280 anonymise_cv_maybe(a,b) S_anonymise_cv_maybe(aTHX_ a,b) any_dup(a,b) Perl_any_dup(aTHX_ a,b) ANYOF 21 /* 0x15 Match character in (or not in) + this class, single char match only */ ANYOF_ALNUM 0 /* \w, PL_utf8_alnum, utf8::IsWord, ALNUM */ ASCIIish ASCII_MORE_RESTRICT_PAT_MODS "aa" ASCII_RESTRICT_PAT_MOD 'a' ASCII_RESTRICT_PAT_MODS "a" ASCII_TO_NATIVE(ch) (ch) ASCII_TO_NATIVE(ch) PL_a2e[(U8)(ch)] ASCII_TO_NEED(enc,ch) (ch) ASCII_TO_NEED(enc,ch) ((enc) ? UTF_TO_NATIVE(ch) : ASCII_TO_NATIVE( +ch)) asctime(a) asctime_r(a, PL_reentrant_buffer->_asctime_buffer) asctime(a) (asctime_r(a, PL_reentrant_buffer->_asctime_buffer) == 0 ? +PL_reentrant_buffer->_asctime_buffer : 0) asctime(a) asctime_r(a, PL_reentrant_buffer->_asctime_buffer, PL_reent +rant_buffer->_asctime_size) asctime(a) (asctime_r(a, PL_reentrant_buffer->_asctime_buffer, PL_reen +trant_buffer->_asctime_size) == 0 ? PL_reentrant_buffer->_asctime_buf +fer : 0) ASCTIME_R_PROTO 0 /**/ AvALLOC(av) ((XPVAV*) SvANY(av))->xav_alloc AvARRAY(av) ((av)->sv_u.svu_array) AvARYLEN(av) (*Perl_av_arylen_p(aTHX_ MUTABLE_AV(av))) av_clear(a) Perl_av_clear(aTHX_ a) av_delete(a,b,c) Perl_av_delete(aTHX_ a,b,c) av_exists(a,b) Perl_av_exists(aTHX_ a,b) av_extend(a,b) Perl_av_extend(aTHX_ a,b) av_fetch(a,b,c) Perl_av_fetch(aTHX_ a,b,c) av_fill(a,b) Perl_av_fill(aTHX_ a,b) AvFILL(av) ((SvRMAGICAL((const SV *) (av))) \ AvFILLp(av) ((XPVAV*) SvANY(av))->xav_fill av_len(a) Perl_av_len(aTHX_ a) av_make(a,b) Perl_av_make(aTHX_ a,b) AvMAX(av) ((XPVAV*) SvANY(av))->xav_max av_pop(a) Perl_av_pop(aTHX_ a) av_push(a,b) Perl_av_push(aTHX_ a,b) AvREAL(av) (SvFLAGS(av) & SVpav_REAL) AvREALISH(av) (SvFLAGS(av) & (SVpav_REAL|SVpav_REIFY)) AvREAL_off(av) (SvFLAGS(av) &= ~SVpav_REAL) AvREAL_on(av) (SvFLAGS(av) |= SVpav_REAL) AvREAL_only(av) (AvREIFY_off(av), SvFLAGS(av) |= SVpav_REAL)
See how useful that is? That’s why we’ve done it that way for hundreds of years, because it helps. You certainly don’t need Unicode to demonstrate this principle, as it applies to any text no matter the character repertoire. I should probably have been clearer about this, because it is an important point:
In these days of bizarre branding using eyecatching typography to get your attention like “CamelCase” and “StUdLyCaPs”, free variation in spacing and hyphenation like “Post-it® Notes” and “postit notes”, and trademarks with non‐ASCII in them like “Häagen‐Dazs®” with its meaningless diacritic or “Encyclopædia Britannica” with its old‐school ligature, it is perhaps more important today than ever before that we have easy access to collation algorithms able to treat “FOOBAR”, “__foobar__”, “Foo Bar”, “foo-bar”, and “ƒöøbɐƦ” as minor variations of the same underlying sequence of six basic letters.
It is ultra‐useful to be able to think of things that ignore things like diacritics, casing, and non‐alphanumerics. There is nothing new here, since this is has historically been one ordering option frequently used by lexicographers, and it remains completely useful today. With more characters in our repertoires that ever before, it’s much harder to group them the way manual sorters have always done it in the past. But we still want to do so.

Those are some of the appeals of the Unicode::Collate module for sorting text. I’ll have to think about how to get this major point across more effectively, because I don’t seem to have done so yet.


In reply to Re^5: best sort by tchrist
in thread best sort by ag4ve

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.