in reply to Re^8: best sort
in thread best sort

You seem to have managed to confuse Unicode with serialization formats. That’s a shame.

As for knowing what sort of data content you have, that has never been Unicode’s job. That is something one must relegate to a higher-level protocol. It’s just like with receiving a file over the web. If you expect to know what to do with the file, then you need various bits of metadata to know how to handle it. If someone sends you a file but doesn’t tell you what’s in it, that’s a personal problem. It’s not a Unicode problem at all. You have a social problem, which is something else altogether. You need a better higher-level protocol is all.

That said, because Unicode was both exceedingly careful and also reasonably clever about how it defined its approved variable‐width serialization schemes, I have no trouble in the world at all knowing which of the three I have:

$ perl -CS -S unichars Singleton > sample-one $ iconv -f UTF-8 -t UTF-16 < sample-one > sample-two $ iconv -f UTF-8 -t UTF-32 < sample-one > sample-three $ file sample-{one,two,three} sample-one: UTF-8 Unicode text sample-two: Little-endian UTF-16 Unicode text sample-three: Unicode text, UTF-32, little-endian

There aren’t many different flavors of Unicode as you frequently allege. There can be only one. That’s what the “uni” part is about. That’s why things like Perl and XML and HTML are always all Unicode, all the time: because it always means the same thing. It makes no matter whether you say chr(233) in Perl, &#233; in HTML, or &#xe9; in XML. Those are always the same character, because the Unicode mapping of assigned code points to characters is always the same and guaranteed never to change. And that character is always LATIN SMALL LETTER E WITH ACUTE. Similarly, something like HTML’s &eacute; always maps to Unicode code point 233. It’s not like the same character is a code point 142 on a Mac and code point 221 on NextStep. That would be wrong. That’s why modern systems like Perl and HTML and XML are 100% Unicode: so that assigned code points always mean the same character. There is only one flavor of Unicode, or it wouldn’t be Unicode.

I suppose you might stump for Unicode 6.0 being a different flavor from Unicode 5.0, but that seems to be putting too fine a point on it. In any event, the strong stability guarantees Unicode avoid train wrecks in that arena.

Which is quite all the time I have for a belligerent anonymous coward, and then some.

Replies are listed 'Best First'.
Re^10: best sort
by BrowserUk (Patriarch) on Aug 18, 2011 at 01:45 UTC
    you need various bits of metadata to know how to handle it.

    Ah. So when a Unicode file is sent somewhere, it needs to be accompanied by another file containing metadata to identify which "unicode" the first contains. But what encoding is the metadata in? Now you need another file ...

    Oh yeah! That's great design.

    a belligerent anonymous coward,

    Translations:

    • belligerent: someone who doesn't immediately agree with the VIP tchrist.
    • anonymous coward: someone you can't intimidate when you run out of good arguments.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Ah. So when a Unicode file is sent somewhere, it needs to be accompanied by another file containing metadata to identify which "unicode" the first contains. But what encoding is the metadata in? Now you need another file ...
      BZZZT!

      (And thank you for playing.)

      Precisely what part of “There is no such thing as a Unicode file.” was it that you didn’t understand?

        “There is no such thing as a Unicode file.”

        BZZZT!

        A sample: (Note well the source!)

        FAQ - UTF-8, UTF-16, UTF-32 & BOM

        unicode.org/faq/utf_bom.html - Cached

        In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Examples: ...

        See, even they recognise the need. Their mistake was not making it compulsory!

        What do you not understand about constructional English?

        1.    A text file: a file that contains text.
        2. An image file: a file that contains an image.
        3.    A video file: a file that contains video.
        4. A unicode file: a file that contains unicode.

        Now, stop being such an ass. As diversionary tactics go, it's a lousy one.

        You've still failed to answer the question. I posed way back there:

        Explain how you are going to solve the problem of sorting names written in [a combination of] Latin, Cyrillic, Arabic, Farsi, Thai, Chinese, Japanese, Urdo, Gaelic Ge'ez, Osmanya, Tifinagh ... et al.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.