perlquestion
japhy
<p>I am working with WHOIS servers and encountering what I believe to be a character-encoding issue; specifically, one particular WHOIS server returns properly-encoded UTF8 text (I think), and another does not; that is, the first returns the ™ character as three high-bit characters (the sequence <code>e2 84 a2</code>), and the second returns accented characters like ĉ and á as single characters (<code>e7</code> and <code>e1</code>).</p>
<p>This inconsistency means that when I display this text in a browser window (charset=utf-8), the ™ character from whois.markmonitor.com appears correctly ™, but the accented characters from whois.registro.br appear as the dreaded black diamond with a question mark �.</p>
<p>What is the best way to 1) detect high-bit characters that are not part of a properly-encoded UTF sequence, and 2) "upgrade" those characters to a properly-encoded UTF sequence?</p>
<!-- Node text goes above. Div tags should contain sig only -->
<div class="pmsig"><div class="pmsig-1936">
Jeffrey Pinyan (Perl, PHP [ugh], JavaScript) — <a href="http://twitter.com/PrayingTheMass">@PrayingTheMass</a><br/>
<a href="http://www.catholiccrossreference.com/"><i>Melius servire volo</i></a><br/>
<a href="http://www.prayingthemass.com/">Catholic Liturgy</a>
</div></div>