Well then here's how I'd do it. I'd check the whole file for UTF-8 sequences and any other bytes with value 128 or above.
- If you find no bytes with value 128-255, then the file is ASCII (or CP-1252 or UTF-8, they're all the same here.)
- If you only find valid UTF-8 byte sequences then it's probably UTF-8. (If the first sequence is at the start of the file and it's a BOM character, value 0xFEFF, then there is very little doubt about it)
- If you only find other upper half bytes then it's CP-1252.
- If you find both, it's more likely that it's CP-1252, but you'd better take a look at it; It could be a corrupt UTF-8 file.
Code to test this, assuming $_ contains the whole file, and is not converted to utf-8:
my(%utf8, %single);
while(/([\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF][\x80-\xBF]|[\xF0
+-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF])|([\x80-\xFF])/g) {
if($1) {
$utf8{$1}++;
} elsif($2) {
$single{$1}++;
}
}
(untested)
If after this code block %single is empty and %utf8 is not empty, then it's UTF-8; if %single is not empty then it's CP-1252 with high certainty if %utf8 is empty.
<You can do simpler tests than this one, that don't involve hashes, but this way it's easier to debug and verify why it decided one way, and not another way.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.