Re: Trying to understand behavior of split and perl in general with UTF-8

You're on the right track. "use utf8" tells perl that you have utf8 chars in your source. You do need "use Encoding", and then you also need to decode what you read and encode what you write.

Follow up with these commands

perldoc perluniintro
perldoc perlunitut
[download]

Comment on Re: Trying to understand behavior of split and perl in general with UTF-8 Download Code

Replies are listed 'Best First'.
Re^2: Trying to understand behavior of split and perl in general with UTF-8 by hdv.jadev (Novice) on Jun 17, 2010 at 20:20 UTC
Ah, I think I know where I went of the cliff now. "use utf8" is only for the source itself, not for any character streams coming in or going out. For that I have to explicitly state the encoding. It seems I did not truly realize that perl has an internal way of using UTF-8 that has nothing to do with what encoding the shell is using. Am I correct in this assumption? With what you and almut told me I got the example code working in a few seconds, so that gives me some hope I do understand at least somewhat better now. Thanks very much for helping me renew my mastery of perl!	[reply]
Re^3: Trying to understand behavior of split and perl in general with UTF-8 by almut (Canon) on Jun 17, 2010 at 20:42 UTC
It seems I did not truly realize that perl has an internal way of using UTF-8 that has nothing to do with what encoding the shell is using. Perl's internal way of representing unicode characters is (almost) UTF-8, too. But for most practical purposes from the user perspective, it helps to ignore this implementation detail¹, and just properly decode your inputs and encode your outputs. ___ ¹ other languages have chosen different internal formats for unicode strings, e.g. Python uses UCS-2 or UCS-4 (build-time option).	[reply]
Re^4: Trying to understand behavior of split and perl in general with UTF-8 by hdv.jadev (Novice) on Jun 18, 2010 at 11:02 UTC
I've been rereading the tuts and have experimented a bit with encodings last evening and things are working as I would expect them to (after your help). Hopefully that means I do grasp the basics better now. Thanks for trying to make me a little bit more knowledgeable.	[reply]
Re^3: Trying to understand behavior of split and perl in general with UTF-8 by hdv.jadev (Novice) on Jul 22, 2010 at 12:16 UTC
Sorry it took so long to get back at this. Holidays... Anyway, just to let you know. Yesterday the person I wrote the script for successfully converted about 2 GB worth of plain-text files with old research data to a new database format. Thanks to your help it all went smoothly.	[reply]