Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have some strange result in my script, where I am using Russian characters from UTF-8. If I do it this way, it works:

#!/usr/bin/perl use strict; use warnings; my (@kmap, @rmap, %map); { no warnings; # Не хуя в +опить @kmap = qw/A B V G D E + J Z I Y K L M N O P R S T U F H C X ! @ # + $ % ^ & * a b v g d e = j z i y k l m n o p r s t u f h c x 1 2 3 4 +5 6 7 8/; @rmap = qw/А Б В Г Д Е &#1025 +; Ж З И Й К Л М Н &#1 +054; П Р С Т У Ф Х Ц +Ч Ш Щ Ъ Ы Ь Э Ю &#107 +1; а б в г д е ё ж &# +1079; и й к л м н о п + р с т у ф х ц ч &#10 +96; щ ъ ы ь э ю я/; } @map{@kmap} = @rmap; my $c; for (@kmap){ print "$_: $map{$_}\t"; print "\n" unless ++$c % 9; }

Output:

A: А B: Б V: В G: Г D: Д +E: Е +: Ё J: Ж Z: З I: И Y: Й K: К L: Л M: М +N: Н O: О P: П R: Р S: С T: Т U: У F: Ф H: Х +C: Ц X: Ч !: Ш @: Щ #: Ъ $: Ы %: Ь ^: Э &: Ю +*: Я a: а b: б v: в g: г d: д e: е =: ё j: ж +z: з i: и y: й k: к l: л m: м n: н o: о p: п +r: р s: с t: т u: у f: ф h: х c: ц x: ч 1: ш +2: щ 3: ъ 4: ы 5: ь 6: э 7: ю 8: я

But, if I do it this way:

#!/usr/bin/perl use strict; use warnings; my (@kmap, @rmap, %map); { no warnings; # Не хуя в +опить @kmap = split //, 'ABVGDE+JZIYKLMNOPRSTUFHCX!@#$%^&*abvgde=jziyklm +noprstufhcx12345678'; @rmap = split //, 'АБВГДЕ&#102 +5;ЖЗИЙКЛМНО&#10 +55;РСТУФХЦЧШ&#1 +065;ЪЫЬЭЮЯабв&# +1075;деёжзийкл& +#1084;нопрстуфх +цчшщъыьэю&#1103 +;'; } @map{@kmap} = @rmap; my $c; for (@kmap){ print "$_: $map{$_}\t"; print "\n" unless ++$c % 9; }

then the output looks like this:

A: �B: � V: �G: � D: �E: &#65 +533; +: �J: � Z: � I: � Y: �K: � L: �M: � + N: �O: � P: �R: � S: �T: � U: �F: � H: �C: &#65 +533; X: �!: � @: � #: � $: �%: � ^: �&: � + *: �a: � b: �v: � g: �d: � e: �=: � j: �z: &#65 +533; i: �y: � k: � l: � m: �n: � o: �p: � + r: �s: � t: �u: � f: �h: � c: �x: � 1: �2: &#65 +533; 3: �4: � 5: � 6: � 7: �8: �

Why the difference?

(I see from preview that it is not showing the characters. The first output is English and Russian mapping; the second is English and question marks in ovals instead of Russian letters, and the output is not well aligned.)

Replies are listed 'Best First'.
Re: Different behaviour in characters in string vs. array?
by ikegami (Patriarch) on Dec 10, 2008 at 21:04 UTC
    You tell us the bytes are UTF-8 encoded characters, but you tell Perl they're iso-latin-1. Adding use utf8; will help. That tells Perl the source is encoded using UTF-8 rather than iso-latin-1.

      Yes, I know about UTF-8 (and also about 'encoding') - I'm sorry I didn't say it like this. I would like to know why it is different between string and list. That is confusing to me.

        perldoc -f split
        A pattern matching the null string (not to be confused with a null pattern // , which is just one member of the set of patterns matching a null string) will split the value of EXPR into separate characters at each point it matches that way.

        The characters are one byte length unless you specify utf8 encoding, so split splits every double byte russian charachter to a couple ASCII characters.

        I would like to know why it is different between string and list

        qw() does split ' ' (separates words), not split // (separates characters). Had you used the former, you would have gotten the same result.

        You might think those two are the same in this case, but they're not because of the bug I indentified.

Re: Different behaviour in characters in string vs. array?
by ccn (Vicar) on Dec 10, 2008 at 21:22 UTC
      encoding has problems, which is why I recommended use utf8;

        Good point.

        So to avoid encoding bugs and warnings at output about "Wide character in print" one can write:

        use utf8; $ENV{LANG} = 'ru_RU.UTF-8'; use open IO => ':locale'; # I tried ':utf8' instead but got warnings

      Removing the "no warnings" would produce a warning. There's a '#' inside the 'qw//', and Perl gets a bit unhappy about those.

      ben@Tyr:~$ perl -wle'print qw/#/' Possible attempt to put comments in qw() list at -e line 1. #

      --
      "Language shapes the way we think, and determines what we can think about."
      -- B. L. Whorf
        In fact there are no # characters in the code. They appear on OP post because of PM can not display unicode characters inside of <code> block. The real code is:
        @kmap = qw/A B V G D E + J Z I Y K L M N O P R S T U F H C X ! @ # $ % ^ & * a b v g d e = j z i y k l m n o p r s t u f h c x 1 2 3 4 5 6 7 8/;
        @rmap = qw/А Б В Г Д Е Ё Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я а б в г д е ё ж з и й к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я/; }
Re: Different behaviour in characters in string vs. array?
by woodpeaker (Novice) on Dec 10, 2008 at 22:09 UTC
    Strange I resolved that problem only with 'use utf8'; It works with GTK2 and console scripts