But what would be the cure?
The least useful property of unicode is that a trivial subset of it can appear to be 'simple text'.
Every other binary format in common use, self-identifies through the use of 'signatures'. Eg. "GIF87a" & "GIF89a".
Some parts have several names, some of which are deprecated. Other associated terms have meant, and in some cases still do mean, two or more different things.
It creates far more problems than it fixes; and is the archetypal 'premature optimisation' that has long since outlived its benefit or purpose.
Just imagine how much simpler, safer, and more efficient it would be if you could read the first few bytes of a file and *know* what it contains.
Imagine how much more efficient it would be if to read the 10 characters starting at the 1073741823th character of a file, you simply did (say):
seek FH, 1073741823 * 3 + SIG_SIZE, 0; read( FH, $in, 10 * 3 );
Instead of having to a) guess the encoding; b) read all the bytes from the beginning counting characters as you go.
Imagine all the other examples of stupid guesswork and inefficiency that I could have used.
Imagine not having to deal with any of them.
Imagine that programmers said "enough is enough"; give us a simple, single, sane, self-describing format for encoding the world's data.
In reply to Re^3: JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255
by BrowserUk
in thread JSON::XS (and JSON::PP) appear to generate invalid UTF-8 for character in range 127 to 255
by Ovid
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |