Re: Guess between UTF8 and Latin1/ISO-8859-1

Sure. Using byte-wise processing, all UTF-8 characters with character code >= 128 must match the following pattern:

  /[\xC0-\xFF][\x80-\xBF]+/
[download]

(Actually you can even put more stringent constraints on the byte sequence, but this will do for a start.)

It means that if you encounter anything matching /[\x80-\xFF]/ outside what's matched by the above pattern, it's not (valid) UTF-8. You can do this, for example, by using this:

my($utf8, $bare) = (0, 0);
use bytes;
while(/(?=[\x80-\xFF])(?:[\xC0-\xFF][\x80-\xBF]+|(.))/g) {
    $bare++ if defined $1;
    $utf8++ unless defined $1;
}
print <<"END"
utf-8: $utf8
bare: $bare
END
[download]

The idea behind the pattern is that the properly formed UTF-8 characters are eaten using the first alternative, and the remaining bytes by the second.

If $bare ends up with a value > 0, then it's not UTF-8. If the string doesn't contain any bytes with character code >= 128, then it doesn't matter which you choose. Both $bare and $utf8 will be zero, in that case.

Comment on Re: Guess between UTF8 and Latin1/ISO-8859-1 Select or Download Code

Replies are listed 'Best First'.
Re: Re: Guess between UTF8 and Latin1/ISO-8859-1 by CountZero (Bishop) on Jan 21, 2004 at 21:09 UTC
<off_topic>If it is that easy, how come my MS Internet Explorer miserably fails to automatically recognize the fact that some files are Unicode and I get all kinds of weird characters on my screen?</off_topic> CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re: Re: Re: Guess between UTF8 and Latin1/ISO-8859-1 by iburrell (Chaplain) on Jan 21, 2004 at 23:26 UTC
Probably because Microsoft stopped the insanity at examining the whole file for character set instead of just examining it for the content type. Not to mention the difficult in trying to figure out the encoding automatically. There is a big difference between "this is invalid UTF-8 so it must Latin1" and "this weird stuff must be EUC-KR". Not to mention, saying a file is Unicode does not specify the encoding. There are multiple encodings for Unicode, and most non-Unicode encodings can be mapped to Unicode, as long as they are declared.	[reply]
Re: Re: Re: Guess between UTF8 and Latin1/ISO-8859-1 by allolex (Curate) on Jan 21, 2004 at 21:29 UTC
I don't think they're using Perl on IE. That pretty much explains everything. ;) Actually, those pages would work right if people bothered declaring which encoding they're using. So many standards... so little compliance. -- Allolex	[reply] [d/l]