Re: How best to avoid mojibake, when attempting to automatically convert documents to utf-8?

and accurately detect their encoding/charset, and reliably convert them to utf-8.

While you can sometimes do a good job, this isn't possible with reliability. This is a rescue/emergency tactic when confronted with broken data. Differing character sets overlap in the bytes that can be used to make them, sometimes a lot. A single byte of garbage can wreck accurate detection on an otherwise obvious/valid guess. The modules you list are the way to go but the two descriptions of this problem you've posted make it feel like an XY problem.

It's only tangentially related but I recommend reading this—🐪🐫🐪🐫🐪: Why does modern Perl avoid UTF-8 by default?—many times. While there is always room for improvement in any endeavor I suspect digging in and seeing how deep the problems actually run may sober your drive to add to the toolset. Go code diving in those modules and add the Unicode::Tussle scripts to the pile if you are getting through the reading too quickly. :P

Comment on Re: How best to avoid mojibake, when attempting to automatically convert documents to utf-8?

Replies are listed 'Best First'.
Re^2: How best to avoid mojibake, when attempting to automatically convert documents to utf-8? by taint (Chaplain) on Dec 21, 2013 at 01:44 UTC
OK. Looks like I responded too soon. So in an effort to do your response justice. I'll try to give it a proper response, this time. :) The article unicode - Why does modern Perl avoid UTF-8 by default? was extremely informative. A big help -- thanks! brian d foy's Unicode-Tussle utilities, could quite possibly go a long way to helping me in my current quest. Thanks again. In the case of Unicode-Tussle. I should be able to use some of them to help, at least determine what ever I'm parsing/gulping/slurping/chomping, claims it's "code points" are. At least they'll likely help creating initial phases of testing. Or maybe provide some bits I can include in a larger, more conclusive test. It's early, but looks promising. Well, I've got more research to do. Thanks again for the great links, Your Mother! --Chris Yes. What say about me, is true.	[reply]
Re^2: How best to avoid mojibake, when attempting to automatically convert documents to utf-8? by taint (Chaplain) on Dec 20, 2013 at 23:48 UTC
Thank you very much Your Mother, for the reply. Sounds discouraging. :( Seems like somebody should do it. Maybe a team effort? I dunno. Still attempting to work out all the details. "but the two descriptions of this problem you've posted make it feel like an XY problem." Any thoughts for a better title? I'm always open for suggestion(s). Thanks again, for the reply Your Mother. Looks like I still have a great deal of reading to do, yet. :/ --Chris UPDATE: Why does modern Perl avoid UTF-8 by default? was a great read. Thanks! Yes. What say about me, is true.	[reply]