speedlight has asked for the wisdom of the Perl Monks concerning the following question:

My Partial Code:
$string =~ /(\X+)\s+(\X+)/; print "String: $string\n"; print "Verify: $1 -- $2 \n";

Result:
String: ¹Š¸£¹…¹„¹ˆ¹† test
Verify: ¹Š¸£¹…¹„¹ˆ¹† -- test

String is in Arabic characters. I can't figure out why "$1" has additional "Â" inserted. So far only, greek, hebrew and arabic characters have such problem.

Below is another example:
String is chinese:

My Partial Code:
$string =~ /(\X+)\s+(\X+)/; print "String: $string\n"; print "Verify: $1 -- $2 \n";

Result:
String: 丰田汽车 test
Verify: 丰田汽车 -- test

I have no problem with Chinese, Thai characters.

Anyone knows what is the problem. I am using perl 5.8.6, and this version is already supporting unicode.

Something mess up with the capture expression $1, $2? How come "$1" does not reflect probably but $string reflect correctly?

edit (broquaint): added <code> tags.

Replies are listed 'Best First'.
Re: Unicode Problem
by halley (Prior) on Feb 03, 2005 at 18:10 UTC
    (1) That looks like an encoding mismatch. Does your terminal support any Unicode encodings? If you're on a Un*x platform, what are your LC_* environment values? What terminal application are you using?

    (2) Unicode is not an encoding. UTF-8 is an encoding. FMTYEWTK about Characters vs Bytes

    (3) Please read the notes on this site to learn how to wrap your code with tags. Writeup Formatting Tips

    --
    [ e d @ h a l l e y . c c ]

      1) This is my environment values:
      # locale
      LANG=en_US.UTF-8
      LC_CTYPE="en_US.UTF-8"
      LC_NUMERIC="en_US.UTF-8"
      LC_TIME="en_US.UTF-8"
      LC_COLLATE="en_US.UTF-8"
      LC_MONETARY="en_US.UTF-8"
      LC_MESSAGES="en_US.UTF-8"
      LC_PAPER="en_US.UTF-8"
      LC_NAME="en_US.UTF-8"
      LC_ADDRESS="en_US.UTF-8"
      LC_TELEPHONE="en_US.UTF-8"
      LC_MEASUREMENT="en_US.UTF-8"
      LC_IDENTIFICATION="en_US.UTF-8"
      LC_ALL=

      I am running background process by calling the perl script.

      Now I realised that after I have made changes to /etc/sysconfig/i18n, LC_CTYPE="en_US", I am able to see the correct result after running the script:

      Wide character in print at ...
      String: ¸§¹„¸¨¸­¸« test
      Verify: ¸§¹„¸¨¸­¸« -- test

      If I remove LC_CTYPE, this is the result:

      String: ¸§¹„¸¨¸­¸« test
      Verify: ¸§¹„¸¨¸­¸« -- test

      Another thing puzzle me, even if I configure the LC_CTYPE="en_US". The output result is correct, however the background process when running the perl script still give me the wrong result. Is there a file to configure system wide LC_CTYPE??

      I am not sure where could be the problem now?

      Is it my system? environment?

      Or is it my script but the $string variable output is correct, however the capture expression $1 is wrong.
      did I miss something here?
        Hi,
        I logged the background process to a file, in the perl script, I added a new command to log the locale environment:

        LANG=POSIX
        LC_CTYPE=POSIX
        LC_NUMERIC=POSIX
        LC_TIME=POSIX
        LC_COLLATE=POSIX
        LC_MONETARY=POSIX
        LC_MESSAGES=POSIX
        LC_PAPER=POSIX
        LC_NAME=POSIX
        LC_ADDRESS=POSIX
        LC_TELEPHONE=POSIX
        LC_MEASUREMENT=POSIX
        LC_IDENTIFICATION=POSIX
        LC_ALL=

        How can I modify the locale settings for the background process?