Unicode Problem

speedlight has asked for the wisdom of the Perl Monks concerning the following question:

My Partial Code:

$string =~ /(\X+)\s+(\X+)/;

print "String: $string\n";
print "Verify: $1 -- $2 \n";
[download]

Result:
String: 鹿聤赂拢鹿聟鹿聞鹿聢鹿聠 test
Verify: 脗鹿脗聤脗赂脗拢脗鹿脗聟脗鹿脗聞脗鹿脗聢脗鹿脗聠 -- test

String is in Arabic characters. I can't figure out why "$1" has additional "脗" inserted. So far only, greek, hebrew and arabic characters have such problem.

Below is another example:
String is chinese:

My Partial Code:

$string =~ /(\X+)\s+(\X+)/;
print "String: $string\n";
print "Verify: $1 -- $2 \n";
[download]

Result:
String: 盲赂掳莽聰掳忙卤陆猫陆娄 test
Verify: 盲赂掳莽聰掳忙卤陆猫陆娄 -- test

I have no problem with Chinese, Thai characters.

Anyone knows what is the problem. I am using perl 5.8.6, and this version is already supporting unicode.

Something mess up with the capture expression $1, $2? How come "$1" does not reflect probably but $string reflect correctly?

edit (broquaint): added <code> tags.

Comment on Unicode Problem Select or Download Code

Replies are listed 'Best First'.
Re: Unicode Problem by halley (Prior) on Feb 03, 2005 at 18:10 UTC
(1) That looks like an encoding mismatch. Does your terminal support any Unicode encodings? If you're on a Unx platform, what are your `LC_` environment values? What terminal application are you using? (2) Unicode is not an encoding. UTF-8 is an encoding. FMTYEWTK about Characters vs Bytes (3) Please read the notes on this site to learn how to wrap your code with tags. Writeup Formatting Tips -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l]
Re^2: Unicode Problem by speedlight (Initiate) on Feb 04, 2005 at 08:08 UTC
1) This is my environment values: # locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= I am running background process by calling the perl script. Now I realised that after I have made changes to /etc/sysconfig/i18n, LC_CTYPE="en_US", I am able to see the correct result after running the script: Wide character in print at ... String: 抚箘辅腑斧 test Verify: 抚箘辅腑斧 -- test If I remove LC_CTYPE, this is the result: String: 抚箘辅腑斧 test Verify: 赂搂鹿聞赂篓赂颅赂芦 -- test Another thing puzzle me, even if I configure the LC_CTYPE="en_US". The output result is correct, however the background process when running the perl script still give me the wrong result. Is there a file to configure system wide LC_CTYPE?? I am not sure where could be the problem now? Is it my system? environment? Or is it my script but the $string variable output is correct, however the capture expression $1 is wrong. did I miss something here?	[reply]
Re^3: Unicode Problem by speedlight (Initiate) on Feb 04, 2005 at 09:15 UTC
Hi, I logged the background process to a file, in the perl script, I added a new command to log the locale environment: LANG=POSIX LC_CTYPE=POSIX LC_NUMERIC=POSIX LC_TIME=POSIX LC_COLLATE=POSIX LC_MONETARY=POSIX LC_MESSAGES=POSIX LC_PAPER=POSIX LC_NAME=POSIX LC_ADDRESS=POSIX LC_TELEPHONE=POSIX LC_MEASUREMENT=POSIX LC_IDENTIFICATION=POSIX LC_ALL= How can I modify the locale settings for the background process?	[reply]
Re^4: Unicode Problem by fglock (Vicar) on Feb 04, 2005 at 14:01 UTC