in reply to How best to avoid mojibake, when attempting to automatically convert documents to utf-8?
and accurately detect their encoding/charset, and reliably convert them to utf-8.
While you can sometimes do a good job, this isn't possible with reliability. This is a rescue/emergency tactic when confronted with broken data. Differing character sets overlap in the bytes that can be used to make them, sometimes a lot. A single byte of garbage can wreck accurate detection on an otherwise obvious/valid guess. The modules you list are the way to go but the two descriptions of this problem you've posted make it feel like an XY problem.
It's only tangentially related but I recommend reading this—🐪🐫🐪🐫🐪: Why does modern Perl avoid UTF-8 by default?—many times. While there is always room for improvement in any endeavor I suspect digging in and seeing how deep the problems actually run may sober your drive to add to the toolset. Go code diving in those modules and add the Unicode::Tussle scripts to the pile if you are getting through the reading too quickly. :P
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: How best to avoid mojibake, when attempting to automatically convert documents to utf-8?
by taint (Chaplain) on Dec 21, 2013 at 01:44 UTC | |
|
Re^2: How best to avoid mojibake, when attempting to automatically convert documents to utf-8?
by taint (Chaplain) on Dec 20, 2013 at 23:48 UTC |