Can someone point me to a comprehensive guide to encodings?
I'm still looking for that one, too. Nevertheless, don't expect too much; the biggest problem as far as I can tell for me comes in the variety of forms the related issues show up.
It isn't a Perl problem, it isn't a system/OS/libs problem, it isn't a setup problem, it isn't an I/O problem: it is more then a bit of all of these.
Once you get your Perl skills sharpened to know what's what, you'll still might be surprized at times.
Make sure you use a Unicode editor when looking at code or data that should contain Unicode characters; I'm not sure Vim is your best friend here. Take care: if you see correct cyrillic chars when editing, this might already be a sign that the data is _not_ Unicode. This proved to be the biggest issue for me personally so far: to be sure about what is in the file/data as opposed to how it looks (in editors, web pages, files, etc.)
And be prepared: other people are confused too, and it won't be an exception to get files/mails/HTML pages where the 'announced' encoding is one, whereas the actual content is a real soup of both Unicode/non-Unicode characters.
I hope, oh, I really really hope that all these headaches will disappear as soon as virtually everyone and everything will line up to Unicode. But I'm afraid that will still take quite a while.
I haven't run into any problems with non-unicode editors for… five years or so. Maybe it's just that I don't view some problems as such; one stops noticing these things. There are at least 5 occasionally-used single-byte encodings for Russian.
I've built a more-or-less complete Unicode toolchain several years ago.
IIRC perl 5.8.0 used to set STDIN & STDOUT to unicode when you had $ENV{LANG} set to a unicode language/encoding (like en_US.UTF-8) but that caused too many problems with backward compatibility, so now you have to explictly set the output and input encoding.
Note that use utf8 only sets the encoding of the script. It has no influence on the input/output encodings.
Just for the sake of completeness: you could also have used the
commandline switch -C. In your case, to set the UTF-8-ness of
the filehandles STDIN and STDOUT:
#!/usr/bin/perl -CIO
use utf8; # the script's encoding, i.e. literal strings, regexes
# ...