akho has asked for the wisdom of the Perl Monks concerning the following question:

Most Revered Monks,

I have a severe misunderstanding; how do encodings work in Perl?

I have a script that takes text as input and basically does several substitutions to it. The text itself contains cyrillic characters; some substitutions have non-ascii characters.

Actually, this script reproduces the problem:

use utf8; while (<>) { s/<</«/g; print; }

If I try to use it on the console or redirect its output to a file, it keeps cyrillic characters but messes up the « (replaces it with �). But if I run it on a block in Vim, it does the right substitution but destroys cyrillic characters (replacing each of them with two unreadable bytes, obviously).

What should I do? Can someone point me to a comprehensive guide to encodings?

Replies are listed 'Best First'.
Re: I/O encoding question
by ysth (Canon) on May 06, 2007 at 18:47 UTC
Re: I/O encoding question
by shmem (Chancellor) on May 06, 2007 at 18:59 UTC
    Have a look at tlu -- TransLiterate Unicode. Worth studying.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: I/O encoding question
by Krambambuli (Curate) on May 06, 2007 at 19:15 UTC
    Can someone point me to a comprehensive guide to encodings?

    I'm still looking for that one, too. Nevertheless, don't expect too much; the biggest problem as far as I can tell for me comes in the variety of forms the related issues show up.

    It isn't a Perl problem, it isn't a system/OS/libs problem, it isn't a setup problem, it isn't an I/O problem: it is more then a bit of all of these.

    Once you get your Perl skills sharpened to know what's what, you'll still might be surprized at times.

    Make sure you use a Unicode editor when looking at code or data that should contain Unicode characters; I'm not sure Vim is your best friend here. Take care: if you see correct cyrillic chars when editing, this might already be a sign that the data is _not_ Unicode. This proved to be the biggest issue for me personally so far: to be sure about what is in the file/data as opposed to how it looks (in editors, web pages, files, etc.)

    And be prepared: other people are confused too, and it won't be an exception to get files/mails/HTML pages where the 'announced' encoding is one, whereas the actual content is a real soup of both Unicode/non-Unicode characters.

    I hope, oh, I really really hope that all these headaches will disappear as soon as virtually everyone and everything will line up to Unicode. But I'm afraid that will still take quite a while.

      Oh well.

      I haven't run into any problems with non-unicode editors for… five years or so. Maybe it's just that I don't view some problems as such; one stops noticing these things. There are at least 5 occasionally-used single-byte encodings for Russian.

      I've built a more-or-less complete Unicode toolchain several years ago.

      My problem was fixed with

      binmode(STDOUT,':utf8'); binmode(STDIN,':utf8');

      I actually expected this to be the default.

        IIRC perl 5.8.0 used to set STDIN & STDOUT to unicode when you had $ENV{LANG} set to a unicode language/encoding (like en_US.UTF-8) but that caused too many problems with backward compatibility, so now you have to explictly set the output and input encoding.

        Note that use utf8 only sets the encoding of the script. It has no influence on the input/output encodings.

        Just for the sake of completeness: you could also have used the commandline switch -C.  In your case, to set the UTF-8-ness of the filehandles STDIN and STDOUT:

        #!/usr/bin/perl -CIO use utf8; # the script's encoding, i.e. literal strings, regexes # ...

        (or, instead of -CIO, you could have written -C3, if you like it shorter...)  See perlrun for the details.