in reply to Re^4: best sort
in thread best sort
Instead, just look at a classic “dictionary sort”, where you fold/ignore case and ignore everything but alphanumerics. That is the way phonebooks and card catalogues have historically been ordered in English, since way back before computers even existed. It is useful.
You can kind of get that running the shell sort -dfu command, but that program is so obsessively oriented toward whitespace separated fields that you need to trick it into using a nonexistent separator, like here using Control-C:
(output excerpts inside the <readmore>)$ perl -nle 's/^\s*#\s*define\s+// and print' perl/*.h | sort -t^C -df
accept PerlSock_accept ACCEPT 96 /* 0x60 Accepts the current matched st +ring. */ access PerlLIO_access add_alternate(a,b,c) S_add_alternate(aTHX_ a,b,c) add_cp_to_invlist(a,b) S_add_cp_to_invlist(aTHX_ a,b) add_data S_add_data addmad(a,b,c) Perl_addmad(aTHX_ a,b,c) ADDOP 305 anchored_utf8 substrs->data[0].utf8_substr ANDAND 323 ANDOP 318 ANONSUB 280 anonymise_cv_maybe(a,b) S_anonymise_cv_maybe(aTHX_ a,b) any_dup(a,b) Perl_any_dup(aTHX_ a,b) ANYOF 21 /* 0x15 Match character in (or not in) + this class, single char match only */ ANYOF_ALNUM 0 /* \w, PL_utf8_alnum, utf8::IsWord, ALNUM */ ASCIIish ASCII_MORE_RESTRICT_PAT_MODS "aa" ASCII_RESTRICT_PAT_MOD 'a' ASCII_RESTRICT_PAT_MODS "a" ASCII_TO_NATIVE(ch) (ch) ASCII_TO_NATIVE(ch) PL_a2e[(U8)(ch)] ASCII_TO_NEED(enc,ch) (ch) ASCII_TO_NEED(enc,ch) ((enc) ? UTF_TO_NATIVE(ch) : ASCII_TO_NATIVE( +ch)) asctime(a) asctime_r(a, PL_reentrant_buffer->_asctime_buffer) asctime(a) (asctime_r(a, PL_reentrant_buffer->_asctime_buffer) == 0 ? +PL_reentrant_buffer->_asctime_buffer : 0) asctime(a) asctime_r(a, PL_reentrant_buffer->_asctime_buffer, PL_reent +rant_buffer->_asctime_size) asctime(a) (asctime_r(a, PL_reentrant_buffer->_asctime_buffer, PL_reen +trant_buffer->_asctime_size) == 0 ? PL_reentrant_buffer->_asctime_buf +fer : 0) ASCTIME_R_PROTO 0 /**/ AvALLOC(av) ((XPVAV*) SvANY(av))->xav_alloc AvARRAY(av) ((av)->sv_u.svu_array) AvARYLEN(av) (*Perl_av_arylen_p(aTHX_ MUTABLE_AV(av))) av_clear(a) Perl_av_clear(aTHX_ a) av_delete(a,b,c) Perl_av_delete(aTHX_ a,b,c) av_exists(a,b) Perl_av_exists(aTHX_ a,b) av_extend(a,b) Perl_av_extend(aTHX_ a,b) av_fetch(a,b,c) Perl_av_fetch(aTHX_ a,b,c) av_fill(a,b) Perl_av_fill(aTHX_ a,b) AvFILL(av) ((SvRMAGICAL((const SV *) (av))) \ AvFILLp(av) ((XPVAV*) SvANY(av))->xav_fill av_len(a) Perl_av_len(aTHX_ a) av_make(a,b) Perl_av_make(aTHX_ a,b) AvMAX(av) ((XPVAV*) SvANY(av))->xav_max av_pop(a) Perl_av_pop(aTHX_ a) av_push(a,b) Perl_av_push(aTHX_ a,b) AvREAL(av) (SvFLAGS(av) & SVpav_REAL) AvREALISH(av) (SvFLAGS(av) & (SVpav_REAL|SVpav_REIFY)) AvREAL_off(av) (SvFLAGS(av) &= ~SVpav_REAL) AvREAL_on(av) (SvFLAGS(av) |= SVpav_REAL) AvREAL_only(av) (AvREIFY_off(av), SvFLAGS(av) |= SVpav_REAL)
In these days of bizarre branding using eyecatching typography to get your attention like “CamelCase” and “StUdLyCaPs”, free variation in spacing and hyphenation like “Post-it® Notes” and “postit notes”, and trademarks with non‐ASCII in them like “Häagen‐Dazs®” with its meaningless diacritic or “Encyclopædia Britannica” with its old‐school ligature, it is perhaps more important today than ever before that we have easy access to collation algorithms able to treat “FOOBAR”, “__foobar__”, “Foo Bar”, “foo-bar”, and “ƒöøbɐƦ” as minor variations of the same underlying sequence of six basic letters.It is ultra‐useful to be able to think of things that ignore things like diacritics, casing, and non‐alphanumerics. There is nothing new here, since this is has historically been one ordering option frequently used by lexicographers, and it remains completely useful today. With more characters in our repertoires that ever before, it’s much harder to group them the way manual sorters have always done it in the past. But we still want to do so.
Those are some of the appeals of the Unicode::Collate module for sorting text. I’ll have to think about how to get this major point across more effectively, because I don’t seem to have done so yet.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^6: best sort
by Tanktalus (Canon) on Aug 16, 2011 at 13:53 UTC | |
by tchrist (Pilgrim) on Aug 16, 2011 at 16:15 UTC | |
|
Re^6: best sort
by BrowserUk (Patriarch) on Aug 16, 2011 at 14:56 UTC | |
by tchrist (Pilgrim) on Aug 16, 2011 at 16:05 UTC | |
by BrowserUk (Patriarch) on Aug 16, 2011 at 20:20 UTC | |
by tchrist (Pilgrim) on Aug 18, 2011 at 00:08 UTC | |
by BrowserUk (Patriarch) on Aug 18, 2011 at 01:45 UTC | |
|