http://qs1969.pair.com?node_id=1174574


in reply to How to concatenate utf8 safely?

Normally one shall never worry about joining strings together in perl. Simple "a" . "b" shall work. If you have problem with that, then most likely you don't understand how things work. Try to read perldoc Encode carefully.

Just in case, here is simplistic description. The applications in computer exchange data as bytes, or octets. "Octets" are not the same as "characters" that humans read. One character can be represented by multiple octets. If your program does not care about characters (it does not try to make them upper or lower case, it does not split on characters etc.) then your program may simply take data in or give data out without worrying about UTF, Unicode or whatever. But usually one has to manipulate characters, that's where confusion starts.

First of all, you have to worry about representation of characters in the octets that you receive from external applications. That depends on locale settings, but most of modern unixes provide characters encoded as UTF-8. After you receive data from outside, you have to tell perl the encoding of the data, so that perl can split that data on characters. This is done either by using Encode::decode directly, or by adjusting input stream so, that it does this operation for you (by using binmode for example). After this, perl is ready to view your data as characters instead of octets.

Of course you also have to worry about strings that you type directly into perl code. Perl has to know about their encoding as well. If your editor by default saves all data in UTF-8, then you can put into code "use utf8;" so that perl automatically calls Encode::decode on all your quoted strings and patterns. Or again, without "use utf8;" you can call Encode::decode directly.

The 2 steps above ensure that perl knows how to split your strings into characters. But if you want to output your character strings to the outside world, you have to do the reverse conversion from "characters string" to "octets string". Again, to do that, you can either call Encode::encode directly, or configure your output stream so that it does it for you automatically.

If all the steps are handled correctly, then you never have to worry about strings concatenation.