in reply to What's the best way to detect character encodings, Windows-1252 v. UTF-8?
You could use heuristical/statistical methods and simply base a guess on the frequency of occurence of bytes (the repertoire) what kind of encoding it is, for example in a French text you'll find lots of "é", "è", "ê", "à" and "ç", but something like "þ" will be extremely rare.
I'm guessing there will also be modules to help you, like Encode::Guess, but I've never used it. I haven't had the need for it, thus far, but it might be better than trying to come up with something elaborate yourself. On the other hand, this particular module is focused on Far Eastern encodings (for Japanese and Chinese, among others) so it might not be the best fit for your purpose.
References:
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
by Jim (Curate) on Jun 17, 2011 at 15:40 UTC | |
by bart (Canon) on Jun 23, 2011 at 11:37 UTC |