Re: Perl detect utf8, iso-8859-1 encoding

Fundamentally, you cannot reliably detect encodings. You can guess UTF-8 if the input is valid UTF-8, but that is still a guess at best.

The problem is that pre-Unicode encodings actually made full use of the available 256 codepoints in an octet. UTF-8 must use those same 256 codepoints (and the lower 128 are ASCII), so all valid UTF-8 is also valid in other encodings. There is no general solution to this problem, although you might be able to make some headway with either a dictionary of valid names, or some rules for recognizing "plausible" names — that is, names that use only characters used in names from one language, since mixed-language names are highly unlikely.

For the special case of deciding whether the input is UTF-8 as requested or ISO-Latin-1 due to following an outdated link, you can probably make good progress by simply checking if the input is valid UTF-8 and assuming ISO-Latin-1 if not. This is not exactly correct, but is probably a fair heuristic.

Comment on Re: Perl detect utf8, iso-8859-1 encoding

Replies are listed 'Best First'.
Re^2: Perl detect utf8, iso-8859-1 encoding by swiftlet (Acolyte) on Jul 25, 2020 at 00:50 UTC
simply checking if the input is valid UTF-8 and assuming ISO-Latin-1 if not Thanks! This is a good idea, but how could I find out if the input is a valid utf-8 or not? Both utf8::valid and utf8::is_utf8 are not working well in my examples	[reply]
Re^3: Perl detect utf8, iso-8859-1 encoding by haj (Vicar) on Jul 25, 2020 at 08:50 UTC
To check whether data are valid UTF-8 is rather straightforward. Here's the example, slightly modified from the synopsis of Encode: `use Encode qw(decode encode); $characters = decode('UTF-8', $octets, Encode::FB_CROAK \| Encode::LEAVE_SRC);` [download] This code will `die` if there are invalid data, so you would wrap it into the exception handler of your choice, plain `eval` and Try::Tiny seem to be popular. BTW: as jcb already indicated, chances are excellent that if data pass as UTF-8, they actually are UTF-8. All bytes of multibyte characters in valid UTF-8 strings are in the range `\x80` to `\xFF`, and in particular the bytes 2-4 are in the range `\x80-\xBF`. You just can't build readable text from characters in that range in any of the ISO-8859-* encodings, and about half of that range are "unprintable" control characters from ISO/IEC 6429.	[reply] [d/l]


Don't ask to ask, just ask
	PerlMonks