comment on

Then ignore the matter of letters with diacritics for now, since you seem either not to use them or else not to care if they change something from being the same letter and it send scurrying off to a completely different segment of your output.

Instead, just look at a classic “dictionary sort”, where you fold/ignore case and ignore everything but alphanumerics. That is the way phonebooks and card catalogues have historically been ordered in English, since way back before computers even existed. It is useful.

You can kind of get that running the shell sort -dfu command, but that program is so obsessively oriented toward whitespace separated fields that you need to trick it into using a nonexistent separator, like here using Control-C:

$ perl -nle 's/^\s*#\s*define\s+// and print' perl/*.h | sort -t^C -df
[download]

(output excerpts inside the <readmore>)

accept          PerlSock_accept
ACCEPT                  96      /* 0x60 Accepts the current matched st
+ring. */
access          PerlLIO_access
add_alternate(a,b,c)    S_add_alternate(aTHX_ a,b,c)
add_cp_to_invlist(a,b)  S_add_cp_to_invlist(aTHX_ a,b)
add_data                S_add_data
addmad(a,b,c)           Perl_addmad(aTHX_ a,b,c)
ADDOP 305

anchored_utf8 substrs->data[0].utf8_substr
ANDAND 323
ANDOP 318
ANONSUB 280
anonymise_cv_maybe(a,b) S_anonymise_cv_maybe(aTHX_ a,b)
any_dup(a,b)            Perl_any_dup(aTHX_ a,b)
ANYOF                   21      /* 0x15 Match character in (or not in)
+ this class, single char match only */
ANYOF_ALNUM      0      /* \w, PL_utf8_alnum, utf8::IsWord, ALNUM */

ASCIIish
ASCII_MORE_RESTRICT_PAT_MODS "aa"
ASCII_RESTRICT_PAT_MOD 'a'
ASCII_RESTRICT_PAT_MODS "a"
ASCII_TO_NATIVE(ch)      (ch)
ASCII_TO_NATIVE(ch)      PL_a2e[(U8)(ch)]
ASCII_TO_NEED(enc,ch)    (ch)
ASCII_TO_NEED(enc,ch)    ((enc) ? UTF_TO_NATIVE(ch) : ASCII_TO_NATIVE(
+ch))
asctime(a) asctime_r(a, PL_reentrant_buffer->_asctime_buffer)
asctime(a) (asctime_r(a, PL_reentrant_buffer->_asctime_buffer) == 0 ? 
+PL_reentrant_buffer->_asctime_buffer : 0)
asctime(a) asctime_r(a, PL_reentrant_buffer->_asctime_buffer, PL_reent
+rant_buffer->_asctime_size)
asctime(a) (asctime_r(a, PL_reentrant_buffer->_asctime_buffer, PL_reen
+trant_buffer->_asctime_size) == 0 ? PL_reentrant_buffer->_asctime_buf
+fer : 0)
ASCTIME_R_PROTO 0          /**/

AvALLOC(av)     ((XPVAV*)  SvANY(av))->xav_alloc
AvARRAY(av)     ((av)->sv_u.svu_array)
AvARYLEN(av)    (*Perl_av_arylen_p(aTHX_ MUTABLE_AV(av)))
av_clear(a)             Perl_av_clear(aTHX_ a)
av_delete(a,b,c)        Perl_av_delete(aTHX_ a,b,c)
av_exists(a,b)          Perl_av_exists(aTHX_ a,b)
av_extend(a,b)          Perl_av_extend(aTHX_ a,b)
av_fetch(a,b,c)         Perl_av_fetch(aTHX_ a,b,c)
av_fill(a,b)            Perl_av_fill(aTHX_ a,b)
AvFILL(av)      ((SvRMAGICAL((const SV *) (av))) \
AvFILLp(av)     ((XPVAV*)  SvANY(av))->xav_fill
av_len(a)               Perl_av_len(aTHX_ a)
av_make(a,b)            Perl_av_make(aTHX_ a,b)
AvMAX(av)       ((XPVAV*)  SvANY(av))->xav_max
av_pop(a)               Perl_av_pop(aTHX_ a)
av_push(a,b)            Perl_av_push(aTHX_ a,b)
AvREAL(av)      (SvFLAGS(av) & SVpav_REAL)
AvREALISH(av)   (SvFLAGS(av) & (SVpav_REAL|SVpav_REIFY))
AvREAL_off(av)  (SvFLAGS(av) &= ~SVpav_REAL)
AvREAL_on(av)   (SvFLAGS(av) |= SVpav_REAL)
AvREAL_only(av) (AvREIFY_off(av), SvFLAGS(av) |= SVpav_REAL)
[download]

See how useful that is? That’s why we’ve done it that way for hundreds of years, because it helps. You certainly don’t need Unicode to demonstrate this principle, as it applies to any text no matter the character repertoire. I should probably have been clearer about this, because it is an important point:

In these days of bizarre branding using eyecatching typography to get your attention like “CamelCase” and “StUdLyCaPs”, free variation in spacing and hyphenation like “Post-it® Notes” and “postit notes”, and trademarks with non‐ASCII in them like “Häagen‐Dazs®” with its meaningless diacritic or “Encyclopædia Britannica” with its old‐school ligature, it is perhaps more important today than ever before that we have easy access to collation algorithms able to treat “FOOBAR”, “__foobar__”, “Foo Bar”, “foo-bar”, and “ƒöøbɐƦ” as minor variations of the same underlying sequence of six basic letters.

It is ultra‐useful to be able to think of things that ignore things like diacritics, casing, and non‐alphanumerics. There is nothing new here, since this is has historically been one ordering option frequently used by lexicographers, and it remains completely useful today. With more characters in our repertoires that ever before, it’s much harder to group them the way manual sorters have always done it in the past. But we still want to do so.

Those are some of the appeals of the Unicode::Collate module for sorting text. I’ll have to think about how to get this major point across more effectively, because I don’t seem to have done so yet.

In reply to Re^5: best sort by tchrist
in thread best sort by ag4ve

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.