Re^3: Character encoding woes

in reply to Re^2: Character encoding woes - unicode or not?
in thread Character encoding woes - unicode or not?

Encode::Detect is my new best friend, thanks to Anonymous Monk.

Then you don't know it very well. It’s really rather poor. Here are the failure results on a small sampling of real-world test files:

 Status  Filename    Right         E::D::Detector
 ==================================================
 Wrong!   7843118       ascii       UNABLE TO GUESS
          9501430       utf8        UTF-8
         10318897       cp1252      windows-1252
         10329150       cp1252      windows-1252
 Wrong!  10358003       MacRoman    UNABLE TO GUESS
 Wrong!  10358042       MacRoman    UNABLE TO GUESS
 Wrong!  10429209       MacRoman    UNABLE TO GUESS
         10482611       cp1252      windows-1252
 Wrong!  10542098       MacRoman    UNABLE TO GUESS
         10617571       cp1252      windows-1252
 Wrong!  10625668       iso-8859-1  windows-1252
         10676968       cp1252      windows-1252
         10677497       cp1252      windows-1252
         10963661       MacRoman    UNABLE TO GUESS
         11042188       macRoman    UNABLE TO GUESS
         11212329       utf8        UTF-8
         11287402       cp1252      windows-1252
 Wrong!  11470876       MacRoman    windows-1252
         11842027       iso-8859-1  windows-1252
 Wrong!  11940257       ascii       UNABLE TO GUESS
 Wrong!  11972335       MacRoman    UNABLE TO GUESS
 Wrong!  12091502       iso-8859-1  windows-1252
         12169614       utf8        UTF-8
         12495435       MacRoman    windows-1252
         12736309       MacRoman    windows-1252
         14641909       MacRoman    windows-1252
         14652344       utf8        UTF-8
         14751857       cp1252      windows-1252
         15037632       cp1252      windows-1252
         15070898       cp1252      windows-1252
 Wrong!  15154606       MacRoman    windows-1252
         15201223       cp1252      windows-1252
 Wrong!  15315962       iso-8859-1  Big5
         15328020       cp1252      windows-1252
 Wrong!  17298172       MacRoman    windows-1252
 Wrong! 116059400       MacRoman    windows-1252
[download]

I have a working snapshot of a module that actually does work right on such things called Encode::Guess::Educated. It has no noncore dependencies. It is designed to detect the encoding of English-language biomedical research papers. It can reliably detect not merely ASCII and UTF-{8,16,32}, but also the very conflicting 8-bit encodings.

The reason it can do this is that it works off a training model. I looked at three different corpora to do this: one containing 3½M non-ASCII codepoints, one containing 14M of them, and one containing 29M of them. It makes an educated guest based on conformance to a particular model. And it does very well.

Right now it has only a CLI API and an OO API, no Export-based one. Here’s the easiest way to use the CLI API, via a simple program called gank:

$ gank 011526914.txt
cp1252

$ gank 00*.txt Sym*.txt 0115*.txt
001313968.txt: ascii
001328180.txt: utf8
007499277.txt: iso-8859-1
Symbola602.txt: UTF-16
011526914.txt: cp1252
011535589.txt: iso-8859-1
011570876.txt: MacRoman
[download]

The underlying class’s default training model derives from the complete PubMed Open Access corpus, and it therefore attains an extremely high measured accuracy of 99.79% when used on English-language biomedical texts. It also does well on other texts using any Latin-based alphabet. I have comparative statistics using two alternate training models, but the PMCOA model is fine for most purposes.

You may also give gank a -s option to give you a short ‘score-card’ of the various encodings it considered:

  *91.718532 +2.285393 MacRoman
    3.640513 -0.941206 iso-8859-1, iso-8859-15, cp1252
    3.639257 -0.941552 cp1250
    1.001698 -2.231634 iso-8859-2
[download]

EXPLANATION:

The first column is all scores normalized to 0..100.
The second column is the natural log of the real score.
The rest is which encoding have that score, and in the order of preference for breaking ties of the same score. I have it arranged so it says it’s the smallest subset that works; i.e., ascii < latin1 < cp1252, etc.

There’s also a -l option to give you a long report that illustrates what each possible shoice would be if it were in that encoding, with paired lines of literal UTF-8 and \N{...} named characters.

total bytes=15903, high bytes=22, distinct high bytes=8
  *49.582509 +0.909655 cp1252
      => "I–V, Copyright © 2001 Outline • Acknowledgements 12000× g Südhof Marquèze, Llinás. ScienceDirect® is"
      => "I\N{EN DASH}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{BULLET} Acknowledgements 12000\N{MULTIPLICATION SIGN} g S\N{LATIN SMALL LETTER U WITH DIAERESIS}dhof Marqu\N{LATIN SMALL LETTER E WITH GRAVE}ze, Llin\N{LATIN SMALL LETTER A WITH ACUTE}s. ScienceDirect\N{REGISTERED SIGN} is"
   49.557280 +0.909146 cp1250
      => "I–V, Copyright © 2001 Outline • Acknowledgements 12000× g Südhof Marqučze, Llinás. ScienceDirect® is"
      => "I\N{EN DASH}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{BULLET} Acknowledgements 12000\N{MULTIPLICATION SIGN} g S\N{LATIN SMALL LETTER U WITH DIAERESIS}dhof Marqu\N{LATIN SMALL LETTER C WITH CARON}ze, Llin\N{LATIN SMALL LETTER A WITH ACUTE}s. ScienceDirect\N{REGISTERED SIGN} is"
    0.860211 -3.144560 MacRoman
      => "IñV, Copyright © 2001 Outline ï Acknowledgements 12000◊ g S¸dhof MarquËze, Llin·s. ScienceDirectÆ is"
      => "I\N{LATIN SMALL LETTER N WITH TILDE}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{LATIN SMALL LETTER I WITH DIAERESIS} Acknowledgements 12000\N{LOZENGE} g S\N{CEDILLA}dhof Marqu\N{LATIN CAPITAL LETTER E WITH DIAERESIS}ze, Llin\N{MIDDLE DOT}s. ScienceDirect\N{LATIN CAPITAL LETTER AE} is"

I need to do more work on its API — this is just a proof of concept, although it does comes with a halfway decent test suite — and of course document it, but I’m hunkered down right now correcting page-proofs on Camel4, so I probably won’t get to sprucing up the module for another 7–10 days.

--tom

In Section Seekers of Perl Wisdom