Re^7: Converting Unicode

As a linguist, I've worked with various languages including Arabic, Chinese, or Tamil. We processed corpora in those languages in Perl, we even built an treebank annotation tool in Tk. We never had problems with Unicode. 🤷🏽

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

Comment on Re^7: Converting Unicode Download Code

Replies are listed 'Best First'.
Re^8: Converting Unicode by Polyglot (Chaplain) on Dec 19, 2023 at 18:48 UTC
If you never had problems, it's because you were experienced enough to stay on top of things and not merely allow Perl to just do its thing. Perl uses UTF-8 only when it thinks it is beneficial, so if all the characters in your string are in the range 0..255, there's a good chance the characters are all packed in bytes--but in the absence of other knowledge, you can't be sure because Perl converts between fixed 8-bit characters and variable-length UTF-8 characters as necessary. (Programming Perl, p. 403) The "as necessary" is not necessarily as you might wish, as those less experienced quickly learn the hard way. Even the experienced, facing more complex requirements (just working with Chinese is not necessarily complex--it depends on the workflow and the forms of I/O required), often find hidden "gotchas," such as with locales, filenames, databases, incorporating other Perl modules, etc. Blessings, ~Polyglot~	[reply]

Replies are listed 'Best First'.

Re^8: Converting Unicode
by Polyglot (Chaplain) on Dec 19, 2023 at 18:48 UTC

Perl uses UTF-8 only when it thinks it is beneficial, so if all the characters in your string are in the range 0..255, there's a good chance the characters are all packed in bytes--but in the absence of other knowledge, you can't be sure because Perl converts between fixed 8-bit characters and variable-length UTF-8 characters as necessary. (Programming Perl, p. 403)

Blessings,

~Polyglot~

[reply]