japhy has asked for the wisdom of the Perl Monks concerning the following question:
I am working with WHOIS servers and encountering what I believe to be a character-encoding issue; specifically, one particular WHOIS server returns properly-encoded UTF8 text (I think), and another does not; that is, the first returns the ™ character as three high-bit characters (the sequence e2 84 a2), and the second returns accented characters like ĉ and á as single characters (e7 and e1).
This inconsistency means that when I display this text in a browser window (charset=utf-8), the ™ character from whois.markmonitor.com appears correctly ™, but the accented characters from whois.registro.br appear as the dreaded black diamond with a question mark �.
What is the best way to 1) detect high-bit characters that are not part of a properly-encoded UTF sequence, and 2) "upgrade" those characters to a properly-encoded UTF sequence?