in reply to Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
in thread What's the best way to detect character encodings, Windows-1252 v. UTF-8?

Thank you very much, ikegami.

Unless it's valid US-ASCII, in which case it doesn't matter if you use Windows-1252 or UTF-8.

Yep. Any purely ASCII text files will simply get a UTF-8 byte order mark prefixed to them, forcing them into Unicode goodness.

EBCDIC text files will be blown to smithereens. In the context of what I'm doing, I don't care.

Jim

  • Comment on Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8?

Replies are listed 'Best First'.
Re^3: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
by ikegami (Patriarch) on Jun 17, 2011 at 16:09 UTC
    • A purely US-ASCII text file cannot contain a Unicode BOM.
    • BOM don't force Unicode goodness, whatever that means.
    • I don't know why you bring up EBCDIC. You said only Windows-1252 and UTF-8 are possible.

    I changed the wording of the text you quoted in the hopes of being clearer.

      Uh, I was writing whimsically and lightheardedly. (My goodness, you can find fault and contention in the most inocuous and innocent places, ikegami.)

      I know an ASCII text file cannot contain a Unicode BOM. The whole point of what I'm doing is to convert all the text files to Unicode if they aren't Unicode already. A purely ASCII text file is also a Unicode text file, just as it is also a text file in almost all other character encodings (but not EBCDIC, for example). So I'm going to add a BOM to all purely ASCII text files to make them not purely ASCII text files anymore. I'm doing this because, for better or worse, the world is now full of software that requires Unicode and is insistent that the Unicode-ness be unequivocal (i.e., that the text includes a BOM).

      I mentioned EBCDIC as a lark. Smile, would ya! :-)

      Thank you again for your help.

      Jim