I am working with WHOIS servers and encountering what I believe to be a character-encoding issue; specifically, one particular WHOIS server returns properly-encoded UTF8 text (I think), and another does not; that is, the first returns the ™ character as three high-bit characters (the sequence e2 84 a2), and the second returns accented characters like ĉ and á as single characters (e7 and e1).
This inconsistency means that when I display this text in a browser window (charset=utf-8), the ™ character from whois.markmonitor.com appears correctly ™, but the accented characters from whois.registro.br appear as the dreaded black diamond with a question mark �.
What is the best way to 1) detect high-bit characters that are not part of a properly-encoded UTF sequence, and 2) "upgrade" those characters to a properly-encoded UTF sequence?
In reply to Character encoding woes - unicode or not? by japhy
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |