Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Character encoding woes - unicode or not?

by japhy (Canon)
on Jan 31, 2012 at 14:24 UTC ( [id://950986]=perlquestion: print w/replies, xml ) Need Help??

japhy has asked for the wisdom of the Perl Monks concerning the following question:

I am working with WHOIS servers and encountering what I believe to be a character-encoding issue; specifically, one particular WHOIS server returns properly-encoded UTF8 text (I think), and another does not; that is, the first returns the ™ character as three high-bit characters (the sequence e2 84 a2), and the second returns accented characters like ĉ and á as single characters (e7 and e1).

This inconsistency means that when I display this text in a browser window (charset=utf-8), the ™ character from whois.markmonitor.com appears correctly ™, but the accented characters from whois.registro.br appear as the dreaded black diamond with a question mark �.

What is the best way to 1) detect high-bit characters that are not part of a properly-encoded UTF sequence, and 2) "upgrade" those characters to a properly-encoded UTF sequence?

Jeffrey Pinyan (Perl, PHP ugh, JavaScript) — @PrayingTheMass
Melius servire volo
Catholic Liturgy

Replies are listed 'Best First'.
Re: Character encoding woes - unicode or not?
by Anonymous Monk on Jan 31, 2012 at 14:32 UTC
    Simply try to decode for several encodings in order, and catch the exceptions if it fails.

    Encode::decode($encoding, $octets, Encode::FB_CROAK)

    http://stackoverflow.com/a/1974459

        Encode::Detect is my new best friend, thanks to Anonymous Monk.
        Then you don't know it very well. It’s really rather poor. Here are the failure results on a small sampling of real-world test files:
         Status  Filename Right      E::D::Detector  ==================================================  Wrong!   7843118       ascii       UNABLE TO GUESS           9501430       utf8        UTF-8          10318897       cp1252      windows-1252          10329150       cp1252      windows-1252  Wrong!  10358003       MacRoman    UNABLE TO GUESS  Wrong!  10358042       MacRoman    UNABLE TO GUESS  Wrong!  10429209       MacRoman    UNABLE TO GUESS          10482611       cp1252      windows-1252  Wrong!  10542098       MacRoman    UNABLE TO GUESS          10617571       cp1252      windows-1252  Wrong!  10625668       iso-8859-1  windows-1252          10676968       cp1252      windows-1252          10677497       cp1252      windows-1252          10963661       MacRoman    UNABLE TO GUESS          11042188       macRoman    UNABLE TO GUESS          11212329       utf8        UTF-8          11287402       cp1252      windows-1252  Wrong!  11470876       MacRoman    windows-1252          11842027       iso-8859-1  windows-1252  Wrong!  11940257       ascii       UNABLE TO GUESS  Wrong!  11972335       MacRoman    UNABLE TO GUESS  Wrong!  12091502       iso-8859-1  windows-1252          12169614       utf8        UTF-8          12495435       MacRoman    windows-1252          12736309       MacRoman    windows-1252          14641909       MacRoman    windows-1252          14652344       utf8        UTF-8          14751857       cp1252      windows-1252          15037632       cp1252      windows-1252          15070898       cp1252      windows-1252  Wrong!  15154606       MacRoman    windows-1252          15201223       cp1252      windows-1252  Wrong!  15315962       iso-8859-1  Big5          15328020       cp1252      windows-1252  Wrong!  17298172       MacRoman    windows-1252  Wrong! 116059400       MacRoman    windows-1252
        I have a working snapshot of a module that actually does work right on such things called Encode::Guess::Educated. It has no noncore dependencies. It is designed to detect the encoding of English-language biomedical research papers. It can reliably detect not merely ASCII and UTF-{8,16,32}, but also the very conflicting 8-bit encodings.

        The reason it can do this is that it works off a training model. I looked at three different corpora to do this: one containing 3½M non-ASCII codepoints, one containing 14M of them, and one containing 29M of them. It makes an educated guest based on conformance to a particular model. And it does very well.

        Right now it has only a CLI API and an OO API, no Export-based one. Here’s the easiest way to use the CLI API, via a simple program called gank:

        $ gank 011526914.txt cp1252 $ gank 00*.txt Sym*.txt 0115*.txt 001313968.txt: ascii 001328180.txt: utf8 007499277.txt: iso-8859-1 Symbola602.txt: UTF-16 011526914.txt: cp1252 011535589.txt: iso-8859-1 011570876.txt: MacRoman
        The underlying class’s default training model derives from the complete PubMed Open Access corpus, and it therefore attains an extremely high measured accuracy of 99.79% when used on English-language biomedical texts. It also does well on other texts using any Latin-based alphabet. I have comparative statistics using two alternate training models, but the PMCOA model is fine for most purposes.

        You may also give gank a -s option to give you a short ‘score-card’ of the various encodings it considered:

        *91.718532 +2.285393 MacRoman 3.640513 -0.941206 iso-8859-1, iso-8859-15, cp1252 3.639257 -0.941552 cp1250 1.001698 -2.231634 iso-8859-2
        EXPLANATION:
        • The first column is all scores normalized to 0..100.
        • The second column is the natural log of the real score.
        • The rest is which encoding have that score, and in the order of preference for breaking ties of the same score. I have it arranged so it says it’s the smallest subset that works; i.e., ascii < latin1 < cp1252, etc.
        There’s also a -l option to give you a long report that illustrates what each possible shoice would be if it were in that encoding, with paired lines of literal UTF-8 and \N{...} named characters.
        total bytes=15903, high bytes=22, distinct high bytes=8
          *49.582509 +0.909655 cp1252
              => "I–V, Copyright © 2001 Outline • Acknowledgements 12000× g Südhof Marquèze, Llinás. ScienceDirect® is"
              => "I\N{EN DASH}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{BULLET} Acknowledgements 12000\N{MULTIPLICATION SIGN} g S\N{LATIN SMALL LETTER U WITH DIAERESIS}dhof Marqu\N{LATIN SMALL LETTER E WITH GRAVE}ze, Llin\N{LATIN SMALL LETTER A WITH ACUTE}s. ScienceDirect\N{REGISTERED SIGN} is"
           49.557280 +0.909146 cp1250
              => "I–V, Copyright © 2001 Outline • Acknowledgements 12000× g Südhof Marqučze, Llinás. ScienceDirect® is"
              => "I\N{EN DASH}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{BULLET} Acknowledgements 12000\N{MULTIPLICATION SIGN} g S\N{LATIN SMALL LETTER U WITH DIAERESIS}dhof Marqu\N{LATIN SMALL LETTER C WITH CARON}ze, Llin\N{LATIN SMALL LETTER A WITH ACUTE}s. ScienceDirect\N{REGISTERED SIGN} is"
            0.860211 -3.144560 MacRoman
              => "IñV, Copyright © 2001 Outline ï Acknowledgements 12000◊ g S¸dhof MarquËze, Llin·s. ScienceDirectÆ is"
              => "I\N{LATIN SMALL LETTER N WITH TILDE}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{LATIN SMALL LETTER I WITH DIAERESIS} Acknowledgements 12000\N{LOZENGE} g S\N{CEDILLA}dhof Marqu\N{LATIN CAPITAL LETTER E WITH DIAERESIS}ze, Llin\N{MIDDLE DOT}s. ScienceDirect\N{LATIN CAPITAL LETTER AE} is"
        
        I need to do more work on its API — this is just a proof of concept, although it does comes with a halfway decent test suite — and of course document it, but I’m hunkered down right now correcting page-proofs on Camel4, so I probably won’t get to sprucing up the module for another 7–10 days.

        --tom

Re: Character encoding woes - unicode or not?
by repellent (Priest) on Feb 01, 2012 at 07:04 UTC
    use Encode qw(decode); sub decode_it { my $s = shift; eval { $s = decode('UTF-8', $s, 1); 1; } or do { $s = decode('latin1', $s, 1); }; return $s; } use Devel::Peek qw(Dump); Dump decode_it($_) for "\xE2\x84\xA2", "\xE7\xE1"; __END__ SV = PV(0x100820708) at 0x10081c248 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x100202140 "\342\204\242"\0 [UTF8 "\x{2122}"] CUR = 3 LEN = 8 SV = PV(0x100820748) at 0x100860ee0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x10025dec0 "\303\247\303\241"\0 [UTF8 "\x{e7}\x{e1}"] CUR = 4 LEN = 8

    Once decoded by decode_it, your character strings are ready to be UTF-8 encoded right before you put it out onto your web page.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://950986]
Approved by sundialsvc4
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2024-03-28 18:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found