(Actually you can even put more stringent constraints on the byte sequence, but this will do for a start.)/[\xC0-\xFF][\x80-\xBF]+/
It means that if you encounter anything matching /[\x80-\xFF]/ outside what's matched by the above pattern, it's not (valid) UTF-8. You can do this, for example, by using this:
my($utf8, $bare) = (0, 0); use bytes; while(/(?=[\x80-\xFF])(?:[\xC0-\xFF][\x80-\xBF]+|(.))/g) { $bare++ if defined $1; $utf8++ unless defined $1; } print <<"END" utf-8: $utf8 bare: $bare END
The idea behind the pattern is that the properly formed UTF-8 characters are eaten using the first alternative, and the remaining bytes by the second.
If $bare ends up with a value > 0, then it's not UTF-8. If the string doesn't contain any bytes with character code >= 128, then it doesn't matter which you choose. Both $bare and $utf8 will be zero, in that case.
In reply to Re: Guess between UTF8 and Latin1/ISO-8859-1
by bart
in thread Guess between UTF8 and Latin1/ISO-8859-1
by Jenda
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |