I/O encoding question

akho has asked for the wisdom of the Perl Monks concerning the following question:

Most Revered Monks,

I have a severe misunderstanding; how do encodings work in Perl?

I have a script that takes text as input and basically does several substitutions to it. The text itself contains cyrillic characters; some substitutions have non-ascii characters.

Actually, this script reproduces the problem:

use utf8;

while (<>) {
        s/<</«/g;
        print;
}
[download]

If I try to use it on the console or redirect its output to a file, it keeps cyrillic characters but messes up the « (replaces it with �). But if I run it on a block in Vim, it does the right substitution but destroys cyrillic characters (replacing each of them with two unreadable bytes, obviously).

What should I do? Can someone point me to a comprehensive guide to encodings?

Comment on I/O encoding question Download Code

Replies are listed 'Best First'.
Re: I/O encoding question by ysth (Canon) on May 06, 2007 at 18:47 UTC
As a starting point, perluniintro.	[reply]
Re: I/O encoding question by shmem (Chancellor) on May 06, 2007 at 18:59 UTC
Have a look at tlu -- TransLiterate Unicode. Worth studying. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply]
Re: I/O encoding question by Krambambuli (Curate) on May 06, 2007 at 19:15 UTC
Can someone point me to a comprehensive guide to encodings? I'm still looking for that one, too. Nevertheless, don't expect too much; the biggest problem as far as I can tell for me comes in the variety of forms the related issues show up. It isn't a Perl problem, it isn't a system/OS/libs problem, it isn't a setup problem, it isn't an I/O problem: it is more then a bit of all of these. Once you get your Perl skills sharpened to know what's what, you'll still might be surprized at times. Make sure you use a Unicode editor when looking at code or data that should contain Unicode characters; I'm not sure Vim is your best friend here. Take care: if you see correct cyrillic chars when editing, this might already be a sign that the data is _not_ Unicode. This proved to be the biggest issue for me personally so far: to be sure about what is in the file/data as opposed to how it looks (in editors, web pages, files, etc.) And be prepared: other people are confused too, and it won't be an exception to get files/mails/HTML pages where the 'announced' encoding is one, whereas the actual content is a real soup of both Unicode/non-Unicode characters. I hope, oh, I really really hope that all these headaches will disappear as soon as virtually everyone and everything will line up to Unicode. But I'm afraid that will still take quite a while.	[reply]
Re^2: I/O encoding question by akho (Hermit) on May 06, 2007 at 21:00 UTC
Oh well. I haven't run into any problems with non-unicode editors for… five years or so. Maybe it's just that I don't view some problems as such; one stops noticing these things. There are at least 5 occasionally-used single-byte encodings for Russian. I've built a more-or-less complete Unicode toolchain several years ago. My problem was fixed with `binmode(STDOUT,':utf8'); binmode(STDIN,':utf8');` [download] I actually expected this to be the default.	[reply] [d/l]
Re^3: I/O encoding question by Joost (Canon) on May 06, 2007 at 21:31 UTC
IIRC perl 5.8.0 used to set STDIN & STDOUT to unicode when you had $ENV{LANG} set to a unicode language/encoding (like en_US.UTF-8) but that caused too many problems with backward compatibility, so now you have to explictly set the output and input encoding. Note that `use utf8` only sets the encoding of the script. It has no influence on the input/output encodings. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l]
Re^3: I/O encoding question by almut (Canon) on May 06, 2007 at 22:35 UTC
Just for the sake of completeness: you could also have used the commandline switch `-C`. In your case, to set the UTF-8-ness of the filehandles STDIN and STDOUT: `#!/usr/bin/perl -CIO use utf8; # the script's encoding, i.e. literal strings, regexes # ...` [download] (or, instead of `-CIO`, you could have written `-C3`, if you like it shorter...) See perlrun for the details.	[reply] [d/l] [select]