There are byte sequences that are typical for UTF-8. The first byte of a UTF-8 character
must be in the range 0xC0-0xF7 (0xC0-0xDF for 2 bytes; 0xE0-0xEF for 3 bytes; and 0xF0-0xF7 for 4 byte sequences), and all the next bytes are in the range 0x80-0xBF. So if you see an accented character that is not part of such a sequence, you simply know it's not UTF-8. You might guess it's probably IS0-Latin-1 (= ISO-8859-1) or Microsoft's extension of it, the Windows character set AKA CP-1252; but that's not necessarily the case. It could be DOS text, for example... or ISO-8859-15.
You could use heuristical/statistical methods and simply base a guess on the frequency of occurence of bytes (the repertoire) what kind of encoding it is, for example in a French text you'll find lots of "é", "è", "ê", "à" and "ç", but something like "þ" will be extremely rare.
I'm guessing there will also be modules to help you, like Encode::Guess, but I've never used it. I haven't had the need for it, thus far, but it might be better than trying to come up with something elaborate yourself. On the other hand, this particular module is focused on Far Eastern encodings (for Japanese and Chinese, among others) so it might not be the best fit for your purpose.
References:
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.