Re: utf8 char or binary string detection

Disclaimer: my answer is based on how I understood your question, if it is wrong, then please try to rephrase your question to make it easier to understand :)

First of all, some clarification. When your program obtains something from the system, it is always just a sequence of bytes. After that, the program may decide to treat this sequence as something more, than just bytes. When your program gives something back to the system, then it must be again just a sequence of bytes, does not make any difference, how the program was viewing it before.

The text (including file names) may contain multi-byte characters, which follow certain rules (encoding). Still, to the system the text appears to be just sequence of bytes. When perl program receives this sequence, it may decide to view it as chain of characters. For that the function Encode::decode is provided. In fact, to add more convenience and confusion, this functionality can be attached to input stream, but at the base level one just converts bytes to characters using Encode::decode. At this time, the piece of data receives marker "is_utf8". It does not mean, that the text is really in utf-8, it just means, that perl tries to work with it as with characters.

When you want to give that data back to the system, for example during printing to screen, or when writing to file, then you must convert it back to bytes using Encode::encode. This strips the "is_utf8" flag from the data. Again, to add confusion, this conversion may be attached to the output stream.

As a "side-effect" both of the functions may perform conversion from one characters encoding, to another character encoding, but that can create problems if input does not contain text in expected character encoding.

The function utf8::is_utf8 just reports, if perl sees the piece of data as "chain of characters" instead of "chain of bytes". Printing out such data normally produces warning, since for output one must give only "chain of bytes". Again, you can manipulate output stream to automatically perform conversion and avoid warning.

Now, to the problem with file names and utf-8. Quite often "double conversion" may happen. A program gives to the system string containing for example bytes representing Russian characters in UTF-8 encoding. The file system receives this string, but it has an option indicating, that all input to it is in Latin1 encoding and must be converted to UTF-8 encoding. So, the file system converts all data one more time, as result, the user shall see junk, even though this junk is valid UTF-8 encoding. That is why, when mounting external disks I usually provide option "utf8" to the mount command.

Obviously, if your program gets junk encoded as UTF-8, then there's no way for your program to fix things, unless you know how the junk was created in the first place. For example, in the above case, when legal UTF-8 was treated as Latin1 and converted one more time to UTF-8, one can try to do the reverse conversion from UTF-8 to Latin 1. Something like Encode::from_to($bytes, 'UTF-8', 'Latin1'). Again, this is only if you know why you got the junk.

In general, to avoid problems, one should just follow simple rule "when communicating with the system, get and give only bytes (octets)". To achieve it, one can use either Encode module or various pragmas. When working with modules, one have to carefully learn, what those modules expect/produce. If it is not documented, then one has to experiment. Here the "utf8::is_utf8" or "Encode::is_utf8" can be used to check, whether multi-byte data is treated as sequence of bytes, or as sequence of characters.

Comment on Re: utf8 char or binary string detection


Perl Monk, Perl Meditation
	PerlMonks