Simply try to decode for several encodings in order, and catch the exceptions if it fails.
Encode::decode($encoding, $octets, Encode::FB_CROAK)
http://stackoverflow.com/a/1974459 | [reply] [Watch: Dir/Any] [d/l] |
Encode::Detect is my new best friend, thanks to Anonymous Monk.
| [reply] [Watch: Dir/Any] |
Encode::Detect is my new best friend, thanks to Anonymous Monk.
Then you don't know it very well. It’s really rather poor. Here are the failure results on a small sampling of real-world test files:
Status Filename Right E::D::Detector
==================================================
Wrong! 7843118 ascii UNABLE TO GUESS
9501430 utf8 UTF-8
10318897 cp1252 windows-1252
10329150 cp1252 windows-1252
Wrong! 10358003 MacRoman UNABLE TO GUESS
Wrong! 10358042 MacRoman UNABLE TO GUESS
Wrong! 10429209 MacRoman UNABLE TO GUESS
10482611 cp1252 windows-1252
Wrong! 10542098 MacRoman UNABLE TO GUESS
10617571 cp1252 windows-1252
Wrong! 10625668 iso-8859-1 windows-1252
10676968 cp1252 windows-1252
10677497 cp1252 windows-1252
10963661 MacRoman UNABLE TO GUESS
11042188 macRoman UNABLE TO GUESS
11212329 utf8 UTF-8
11287402 cp1252 windows-1252
Wrong! 11470876 MacRoman windows-1252
11842027 iso-8859-1 windows-1252
Wrong! 11940257 ascii UNABLE TO GUESS
Wrong! 11972335 MacRoman UNABLE TO GUESS
Wrong! 12091502 iso-8859-1 windows-1252
12169614 utf8 UTF-8
12495435 MacRoman windows-1252
12736309 MacRoman windows-1252
14641909 MacRoman windows-1252
14652344 utf8 UTF-8
14751857 cp1252 windows-1252
15037632 cp1252 windows-1252
15070898 cp1252 windows-1252
Wrong! 15154606 MacRoman windows-1252
15201223 cp1252 windows-1252
Wrong! 15315962 iso-8859-1 Big5
15328020 cp1252 windows-1252
Wrong! 17298172 MacRoman windows-1252
Wrong! 116059400 MacRoman windows-1252
I have a working snapshot of a module that actually does work right on such things called Encode::Guess::Educated. It has no noncore dependencies.
It is designed to detect the encoding of English-language biomedical research papers. It can reliably detect not merely ASCII and UTF-{8,16,32}, but also the very conflicting 8-bit encodings.
The reason it can do this is that it works off a training model. I looked at three different corpora to do this: one containing 3½M non-ASCII codepoints, one containing 14M of them, and one containing 29M of them.
It makes an educated guest based on conformance to a particular model. And it does very well.
Right now it has only a CLI API and an OO API, no Export-based one. Here’s the easiest way to use the CLI API, via
a simple program called gank:
$ gank 011526914.txt
cp1252
$ gank 00*.txt Sym*.txt 0115*.txt
001313968.txt: ascii
001328180.txt: utf8
007499277.txt: iso-8859-1
Symbola602.txt: UTF-16
011526914.txt: cp1252
011535589.txt: iso-8859-1
011570876.txt: MacRoman
The underlying class’s default training model derives from the complete
PubMed Open Access corpus, and it therefore attains an extremely high measured accuracy of 99.79% when used on English-language biomedical texts.
It also does well on other texts using any Latin-based alphabet. I have
comparative statistics using two alternate training models, but
the PMCOA model is fine for most purposes.
You may also give gank a -s option to give you a short ‘score-card’ of the
various encodings it considered:
*91.718532 +2.285393 MacRoman
3.640513 -0.941206 iso-8859-1, iso-8859-15, cp1252
3.639257 -0.941552 cp1250
1.001698 -2.231634 iso-8859-2
EXPLANATION:
- The first column is all scores normalized to 0..100.
- The second column is the natural log of the real score.
- The rest is which encoding have that score, and in the
order of preference for breaking ties of the same score.
I have it arranged so it says it’s the smallest subset
that works; i.e., ascii < latin1 < cp1252, etc.
There’s also a -l option to give you a long report that illustrates
what each possible shoice would be if it were in that encoding,
with paired lines of literal UTF-8 and \N{...} named characters.
total bytes=15903, high bytes=22, distinct high bytes=8
*49.582509 +0.909655 cp1252
=> "I–V, Copyright © 2001 Outline • Acknowledgements 12000× g Südhof Marquèze, Llinás. ScienceDirect® is"
=> "I\N{EN DASH}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{BULLET} Acknowledgements 12000\N{MULTIPLICATION SIGN} g S\N{LATIN SMALL LETTER U WITH DIAERESIS}dhof Marqu\N{LATIN SMALL LETTER E WITH GRAVE}ze, Llin\N{LATIN SMALL LETTER A WITH ACUTE}s. ScienceDirect\N{REGISTERED SIGN} is"
49.557280 +0.909146 cp1250
=> "I–V, Copyright © 2001 Outline • Acknowledgements 12000× g Südhof Marqučze, Llinás. ScienceDirect® is"
=> "I\N{EN DASH}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{BULLET} Acknowledgements 12000\N{MULTIPLICATION SIGN} g S\N{LATIN SMALL LETTER U WITH DIAERESIS}dhof Marqu\N{LATIN SMALL LETTER C WITH CARON}ze, Llin\N{LATIN SMALL LETTER A WITH ACUTE}s. ScienceDirect\N{REGISTERED SIGN} is"
0.860211 -3.144560 MacRoman
=> "IñV, Copyright © 2001 Outline ï Acknowledgements 12000◊ g S¸dhof MarquËze, Llin·s. ScienceDirectÆ is"
=> "I\N{LATIN SMALL LETTER N WITH TILDE}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{LATIN SMALL LETTER I WITH DIAERESIS} Acknowledgements 12000\N{LOZENGE} g S\N{CEDILLA}dhof Marqu\N{LATIN CAPITAL LETTER E WITH DIAERESIS}ze, Llin\N{MIDDLE DOT}s. ScienceDirect\N{LATIN CAPITAL LETTER AE} is"
I need to do more work on its API — this is just a proof of concept, although it does comes with a halfway decent test suite — and of course document it,
but I’m hunkered down right now correcting page-proofs on Camel4,
so I probably won’t get to sprucing up the module for another 7–10 days.
--tom
| [reply] [Watch: Dir/Any] [d/l] [select] |
use Encode qw(decode);
sub decode_it {
my $s = shift;
eval {
$s = decode('UTF-8', $s, 1);
1;
} or do {
$s = decode('latin1', $s, 1);
};
return $s;
}
use Devel::Peek qw(Dump);
Dump decode_it($_) for "\xE2\x84\xA2", "\xE7\xE1";
__END__
SV = PV(0x100820708) at 0x10081c248
REFCNT = 1
FLAGS = (TEMP,POK,pPOK,UTF8)
PV = 0x100202140 "\342\204\242"\0 [UTF8 "\x{2122}"]
CUR = 3
LEN = 8
SV = PV(0x100820748) at 0x100860ee0
REFCNT = 1
FLAGS = (TEMP,POK,pPOK,UTF8)
PV = 0x10025dec0 "\303\247\303\241"\0 [UTF8 "\x{e7}\x{e1}"]
CUR = 4
LEN = 8
Once decoded by decode_it, your character strings are ready to be UTF-8 encoded right before you put it out onto your web page. | [reply] [Watch: Dir/Any] [d/l] [select] |