comment on

Given that description, any sense of "sorting" seems pretty meaningless. Is there some other term that might better describe a sequencing of elements that is better than random?

If the overall data is (close to) what you describe, my first inclination would be to partition or segregate the data, by checking for the following conditions in the order shown:

chunks that contain null bytes (these are probably UTF16 or UCS2)
chunks that are entirely comprised of 7-bit ASCII
chunks with some non-ASCII that are properly utf8 encoded
chunks with some non-ASCII that are not proper utf8
chunks that are not utf8 but are ~~entirely~~ mostly comprised of bytes in the range 128-255, except for carriage-returns and line-feeds and maybe tabs (some pre-Unicode Asian encodings could behave this way, even though all such encodings could also accommodate ASCII bytes interspersed with non-ASCII byte pairs that make up 16-bit characters).

Obviously, you have to start by using plain old binmode to read the input as raw bytes. In case you didn't look it up yet, the test for step 3 is:

eval { decode("utf8",$input,Encode::FB_CROAK) };
[download]

If the eval succeeds, it's utf8 data.

Default sorting within some of those partitions would make sense. For the others, it's not so much a matter of making sense, but rather just behaving in some consistent, predictable way.

Note that group 2 could actually qualify as a subset of groups 3-5 - and that's a good reason to keep it distinct from those others.

Apart from that, if there's some desire to "classify" or "cluster" the non-ASCII, non-Unicode strings, statistics on byte ngrams can help a fair bit with that (but it remains a bit of a research task, with some training of models required for classification).

(updated to amend the conditions for set 5)

In reply to Re^9: Mixed Unicode and ANSI string comparisons? by graff
in thread Mixed Unicode and ANSI string comparisons? by BrowserUk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.