Different behaviour in characters in string vs. array?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have some strange result in my script, where I am using Russian characters from UTF-8. If I do it this way, it works:

#!/usr/bin/perl
use strict;
use warnings;

my (@kmap, @rmap, %map);
{
    no warnings;        # &#1053;&#1077; &#1093;&#1091;&#1103; &#1074;
+&#1086;&#1087;&#1080;&#1090;&#1100;

    @kmap = qw/A B V G D E + J Z I Y K L M N O P R S T U F H C X ! @ #
+ $ % ^ & * a b v g d e = j z i y k l m n o p r s t u f h c x 1 2 3 4 
+5 6 7 8/;
    @rmap =  qw/&#1040; &#1041; &#1042; &#1043; &#1044; &#1045; &#1025
+; &#1046; &#1047; &#1048; &#1049; &#1050; &#1051; &#1052; &#1053; &#1
+054; &#1055; &#1056; &#1057; &#1058; &#1059; &#1060; &#1061; &#1062; 
+&#1063; &#1064; &#1065; &#1066; &#1067; &#1068; &#1069; &#1070; &#107
+1; &#1072; &#1073; &#1074; &#1075; &#1076; &#1077; &#1105; &#1078; &#
+1079; &#1080; &#1081; &#1082; &#1083; &#1084; &#1085; &#1086; &#1087;
+ &#1088; &#1089; &#1090; &#1091; &#1092; &#1093; &#1094; &#1095; &#10
+96; &#1097; &#1098; &#1099; &#1100; &#1101; &#1102; &#1103;/;
}
@map{@kmap} = @rmap;

my $c;
for (@kmap){
    print "$_: $map{$_}\t";
    print "\n" unless ++$c % 9;
}
[download]

Output:

A: &#1040;    B: &#1041;  V: &#1042;      G: &#1043;    D: &#1044;    
+E: &#1045;    +: &#1025;    J: &#1046;    Z: &#1047;
I: &#1048;    Y: &#1049;  K: &#1050;      L: &#1051;    M: &#1052;    
+N: &#1053;    O: &#1054;    P: &#1055;    R: &#1056;
S: &#1057;    T: &#1058;  U: &#1059;      F: &#1060;    H: &#1061;    
+C: &#1062;    X: &#1063;    !: &#1064;    @: &#1065;
#: &#1066;    $: &#1067;  %: &#1068;      ^: &#1069;    &: &#1070;    
+*: &#1071;    a: &#1072;    b: &#1073;    v: &#1074;
g: &#1075;    d: &#1076;  e: &#1077;      =: &#1105;    j: &#1078;    
+z: &#1079;    i: &#1080;    y: &#1081;    k: &#1082;
l: &#1083;    m: &#1084;  n: &#1085;      o: &#1086;    p: &#1087;    
+r: &#1088;    s: &#1089;    t: &#1090;    u: &#1091;
f: &#1092;    h: &#1093;  c: &#1094;      x: &#1095;    1: &#1096;    
+2: &#1097;    3: &#1098;    4: &#1099;    5: &#1100;
6: &#1101;    7: &#1102;  8: &#1103;
[download]

But, if I do it this way:

#!/usr/bin/perl
use strict;
use warnings;

my (@kmap, @rmap, %map);
{
    no warnings;        # &#1053;&#1077; &#1093;&#1091;&#1103; &#1074;
+&#1086;&#1087;&#1080;&#1090;&#1100;

    @kmap = split //, 'ABVGDE+JZIYKLMNOPRSTUFHCX!@#$%^&*abvgde=jziyklm
+noprstufhcx12345678';
    @rmap = split //, '&#1040;&#1041;&#1042;&#1043;&#1044;&#1045;&#102
+5;&#1046;&#1047;&#1048;&#1049;&#1050;&#1051;&#1052;&#1053;&#1054;&#10
+55;&#1056;&#1057;&#1058;&#1059;&#1060;&#1061;&#1062;&#1063;&#1064;&#1
+065;&#1066;&#1067;&#1068;&#1069;&#1070;&#1071;&#1072;&#1073;&#1074;&#
+1075;&#1076;&#1077;&#1105;&#1078;&#1079;&#1080;&#1081;&#1082;&#1083;&
+#1084;&#1085;&#1086;&#1087;&#1088;&#1089;&#1090;&#1091;&#1092;&#1093;
+&#1094;&#1095;&#1096;&#1097;&#1098;&#1099;&#1100;&#1101;&#1102;&#1103
+;';
}
@map{@kmap} = @rmap;

my $c;
for (@kmap){
    print "$_: $map{$_}\t";
    print "\n" unless ++$c % 9;
}
[download]

then the output looks like this:

A: &#65533;B: &#65533;      V: &#65533;G: &#65533;  D: &#65533;E: &#65
+533;        +: &#65533;J: &#65533;        Z: &#65533;
I: &#65533;    Y: &#65533;K: &#65533;        L: &#65533;M: &#65533;   
+     N: &#65533;O: &#65533;        P: &#65533;R: &#65533;
S: &#65533;T: &#65533;      U: &#65533;F: &#65533;  H: &#65533;C: &#65
+533;        X: &#65533;!: &#65533;        @: &#65533;
#: &#65533;    $: &#65533;%: &#65533;        ^: &#65533;&: &#65533;   
+     *: &#65533;a: &#65533;        b: &#65533;v: &#65533;
g: &#65533;d: &#65533;      e: &#65533;=: &#65533;  j: &#65533;z: &#65
+533;        i: &#65533;y: &#65533;        k: &#65533;
l: &#65533;    m: &#65533;n: &#65533;        o: &#65533;p: &#65533;   
+     r: &#65533;s: &#65533;        t: &#65533;u: &#65533;
f: &#65533;h: &#65533;      c: &#65533;x: &#65533;  1: &#65533;2: &#65
+533;        3: &#65533;4: &#65533;        5: &#65533;
6: &#65533;    7: &#65533;8: &#65533;
[download]

Why the difference?

(I see from preview that it is not showing the characters. The first output is English and Russian mapping; the second is English and question marks in ovals instead of Russian letters, and the output is not well aligned.)

Comment on Different behaviour in characters in string vs. array? Select or Download Code

Replies are listed 'Best First'.
Re: Different behaviour in characters in string vs. array? by ikegami (Patriarch) on Dec 10, 2008 at 21:04 UTC
You tell us the bytes are UTF-8 encoded characters, but you tell Perl they're iso-latin-1. Adding `use utf8;` will help. That tells Perl the source is encoded using UTF-8 rather than iso-latin-1.	[reply] [d/l]
Re^2: Different behaviour in characters in string vs. array? by Anonymous Monk on Dec 10, 2008 at 22:24 UTC
Yes, I know about UTF-8 (and also about 'encoding') - I'm sorry I didn't say it like this. I would like to know why it is different between string and list. That is confusing to me.	[reply]
Re^3: Different behaviour in characters in string vs. array? by ccn (Vicar) on Dec 10, 2008 at 22:35 UTC
perldoc -f split A pattern matching the null string (not to be confused with a null pattern // , which is just one member of the set of patterns matching a null string) will split the value of EXPR into separate characters at each point it matches that way. The characters are one byte length unless you specify utf8 encoding, so split splits every double byte russian charachter to a couple ASCII characters.	[reply]
Re^3: Different behaviour in characters in string vs. array? by ikegami (Patriarch) on Dec 10, 2008 at 22:37 UTC
I would like to know why it is different between string and list `qw()` does `split ' '` (separates words), not `split //` (separates characters). Had you used the former, you would have gotten the same result. You might think those two are the same in this case, but they're not because of the bug I indentified.	[reply] [d/l] [select]
Re^4: Different behaviour in characters in string vs. array? by Anonymous Monk on Dec 10, 2008 at 23:27 UTC
Re: Different behaviour in characters in string vs. array? by ccn (Vicar) on Dec 10, 2008 at 21:22 UTC
Add a line `use encoding 'utf8';` and remove `no warnings;` See Поддержка Unicode for explanations	[reply] [d/l] [select]
Re^2: Different behaviour in characters in string vs. array? by ikegami (Patriarch) on Dec 10, 2008 at 21:27 UTC
encoding has problems, which is why I recommended `use utf8;`	[reply] [d/l]
Re^3: Different behaviour in characters in string vs. array? by ccn (Vicar) on Dec 10, 2008 at 21:45 UTC
Good point. So to avoid `encoding` bugs and warnings at output about "Wide character in print" one can write: `use utf8; $ENV{LANG} = 'ru_RU.UTF-8'; use open IO => ':locale'; # I tried ':utf8' instead but got warnings` [download]	[reply] [d/l] [select]
Re^4: Different behaviour in characters in string vs. array? by ikegami (Patriarch) on Dec 10, 2008 at 22:48 UTC
Re^2: Different behaviour in characters in string vs. array? by oko1 (Deacon) on Dec 10, 2008 at 22:27 UTC
Removing the "no warnings" would produce a warning. There's a '#' inside the 'qw//', and Perl gets a bit unhappy about those. `ben@Tyr:~$ perl -wle'print qw/#/' Possible attempt to put comments in qw() list at -e line 1. #` [download] -- "Language shapes the way we think, and determines what we can think about." -- B. L. Whorf	[reply] [d/l]
Re^3: Different behaviour in characters in string vs. array? by ccn (Vicar) on Dec 10, 2008 at 22:41 UTC
In fact there are no # characters in the code. They appear on OP post because of PM can not display unicode characters inside of <code> block. The real code is: @kmap = qw/A B V G D E + J Z I Y K L M N O P R S T U F H C X ! @ # $ % ^ & * a b v g d e = j z i y k l m n o p r s t u f h c x 1 2 3 4 5 6 7 8/; @rmap = qw/А Б В Г Д Е Ё Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я а б в г д е ё ж з и й к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я/; }	[reply]
Re^4: Different behaviour in characters in string vs. array? by ikegami (Patriarch) on Dec 10, 2008 at 22:50 UTC
Re^5: Different behaviour in characters in string vs. array? by ccn (Vicar) on Dec 10, 2008 at 23:09 UTC
Re: Different behaviour in characters in string vs. array? by woodpeaker (Novice) on Dec 10, 2008 at 22:09 UTC
Strange I resolved that problem only with 'use utf8'; It works with GTK2 and console scripts	[reply]