Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Perl detect utf8, iso-8859-1 encoding

by jcb (Parson)
on Jul 25, 2020 at 00:02 UTC ( [id://11119780]=note: print w/replies, xml ) Need Help??


in reply to Perl detect utf8, iso-8859-1 encoding

Fundamentally, you cannot reliably detect encodings. You can guess UTF-8 if the input is valid UTF-8, but that is still a guess at best.

The problem is that pre-Unicode encodings actually made full use of the available 256 codepoints in an octet. UTF-8 must use those same 256 codepoints (and the lower 128 are ASCII), so all valid UTF-8 is also valid in other encodings. There is no general solution to this problem, although you might be able to make some headway with either a dictionary of valid names, or some rules for recognizing "plausible" names — that is, names that use only characters used in names from one language, since mixed-language names are highly unlikely.

For the special case of deciding whether the input is UTF-8 as requested or ISO-Latin-1 due to following an outdated link, you can probably make good progress by simply checking if the input is valid UTF-8 and assuming ISO-Latin-1 if not. This is not exactly correct, but is probably a fair heuristic.

  • Comment on Re: Perl detect utf8, iso-8859-1 encoding

Replies are listed 'Best First'.
Re^2: Perl detect utf8, iso-8859-1 encoding
by swiftlet (Acolyte) on Jul 25, 2020 at 00:50 UTC

    simply checking if the input is valid UTF-8 and assuming ISO-Latin-1 if not

    Thanks! This is a good idea, but how could I find out if the input is a valid utf-8 or not? Both utf8::valid and utf8::is_utf8 are not working well in my examples

      To check whether data are valid UTF-8 is rather straightforward. Here's the example, slightly modified from the synopsis of Encode:
      use Encode qw(decode encode); $characters = decode('UTF-8', $octets, Encode::FB_CROAK | Encode::LEAVE_SRC);

      This code will die if there are invalid data, so you would wrap it into the exception handler of your choice, plain eval and Try::Tiny seem to be popular.

      BTW: as jcb already indicated, chances are excellent that if data pass as UTF-8, they actually are UTF-8. All bytes of multibyte characters in valid UTF-8 strings are in the range \x80 to \xFF, and in particular the bytes 2-4 are in the range \x80-\xBF. You just can't build readable text from characters in that range in any of the ISO-8859-* encodings, and about half of that range are "unprintable" control characters from ISO/IEC 6429.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11119780]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2024-04-18 01:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found