Re: Unicode nightmare

(1) Make sure you don't lose any metadata that comes with the text (e.g. charset parameter in MIME Content-type)

(2) If your text includes the ESCAPE character, it may have ISO-2022 shift sequences in it which identifty the character set. All the registered character sets are at http://www.itscj.ipsj.or.jp/ISO-IR/. The actual escape codes are defined in each PDF file. There doesn't seem to be a comprehensive table anywhere on the internet! Note that when ISO registry #165 says that the escape sequence (for G2) is ESC 2/4 2/10 4/5, that means "\e\x24\x2A\x45". (Of course "\x24\x2A\x45" are the characters $ * E

You don't have to understand about G0, G1, G2 to recognize the character sets, although you would to actually translate them to Unicode. I don't know if Encode handles ISO-2022 encoding generally. ICU handles the more commonly used parts of it.

Some general character set links:

ICU - International Components for Unicode in C and Java. It has extremely useful data (in the "data" subdirectory) even if you are not planning to use the code.
http://oss.software.ibm.com/icu/
ISO/IEC International Register of Coded Character Sets To Be Used With Escape Sequences *groan*
http://www.itscj.ipsj.or.jp/ISO-IR/
IANA characters set registry
http://www.iana.org/assignments/character-sets
Ecma Standards E.g. Ecma-35 is the same standard as ISO-2022, but it's free!
http://www.ecma-international.org/publications/standards/Standard.htm
Character Model for the World Wide Web
http://www.w3.org/TR/charmod/
Unicode
http://www.unicode.org/
especially
http://www.unicode.org/Public/UNIDATA/
Roman Czyborra's informative web site
http://czyborra.com/

Comment on Re: Unicode nightmare