in reply to Guess between UTF8 and Latin1/ISO-8859-1
(Actually you can even put more stringent constraints on the byte sequence, but this will do for a start.)/[\xC0-\xFF][\x80-\xBF]+/
It means that if you encounter anything matching /[\x80-\xFF]/ outside what's matched by the above pattern, it's not (valid) UTF-8. You can do this, for example, by using this:
my($utf8, $bare) = (0, 0); use bytes; while(/(?=[\x80-\xFF])(?:[\xC0-\xFF][\x80-\xBF]+|(.))/g) { $bare++ if defined $1; $utf8++ unless defined $1; } print <<"END" utf-8: $utf8 bare: $bare END
The idea behind the pattern is that the properly formed UTF-8 characters are eaten using the first alternative, and the remaining bytes by the second.
If $bare ends up with a value > 0, then it's not UTF-8. If the string doesn't contain any bytes with character code >= 128, then it doesn't matter which you choose. Both $bare and $utf8 will be zero, in that case.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Guess between UTF8 and Latin1/ISO-8859-1
by CountZero (Bishop) on Jan 21, 2004 at 21:09 UTC | |
by iburrell (Chaplain) on Jan 21, 2004 at 23:26 UTC | |
by allolex (Curate) on Jan 21, 2004 at 21:29 UTC |