in reply to Re^8: Mixed Unicode and ANSI string comparisons?
in thread Mixed Unicode and ANSI string comparisons?
If the overall data is (close to) what you describe, my first inclination would be to partition or segregate the data, by checking for the following conditions in the order shown:
Obviously, you have to start by using plain old binmode to read the input as raw bytes. In case you didn't look it up yet, the test for step 3 is:
If the eval succeeds, it's utf8 data.eval { decode("utf8",$input,Encode::FB_CROAK) };
Default sorting within some of those partitions would make sense. For the others, it's not so much a matter of making sense, but rather just behaving in some consistent, predictable way.
Note that group 2 could actually qualify as a subset of groups 3-5 - and that's a good reason to keep it distinct from those others.
Apart from that, if there's some desire to "classify" or "cluster" the non-ASCII, non-Unicode strings, statistics on byte ngrams can help a fair bit with that (but it remains a bit of a research task, with some training of models required for classification).
(updated to amend the conditions for set 5)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^10: Mixed Unicode and ANSI string comparisons?
by BrowserUk (Patriarch) on Dec 16, 2015 at 11:55 UTC | |
by Anonymous Monk on Dec 16, 2015 at 12:35 UTC |