tilwani has asked for the wisdom of the Perl Monks concerning the following question:

I have a perl script that reads data from a table in MS SQL server and saves it in a file. Now, there is a column which contains multibyte characters. When I set the LANG parameter of the Unix shell to "UTF-8", the multibyte character is treated as a single character and the length of the column is perfectly fine. However, when I change the LANG to "C", column length is increased as each byte in multibyte character is treated as a separate character. I somehow feel that perl is not properly handling multibyte characters. Perl version - v5.16.3, RHEL - 7. Kindly help.

Replies are listed 'Best First'.
Re: multi byte character issue
by Tux (Canon) on Jan 03, 2020 at 11:01 UTC

    What driver do you use to connect to the database?

    There is a huge difference in behavior between MS ODBC and FreeTDS. See if this post is of any help.


    Enjoy, Have FUN! H.Merijn

      Also note that DBD::ODBC can and does support Unicode (including UTF-8), but ONLY if it was compiled to do so. By default, Unicode support is enabled on Windows, but disabled on all other platforms. So if you need a DBD::ODBC with Unicode support on Linux or any other non-Windows platform, you have to tell DBD::ODBC's Makefile.pl by invoking it with the parameter -u (and recompile). See also "Enabling and Disabling Unicode support" in DBD::ODBC.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: multi byte character issue
by haukex (Archbishop) on Jan 03, 2020 at 09:25 UTC

    I don't know how MS SQL might interact with the environment variables, but it would still be helpful if you could provide a short, runnable piece of code that you can use to reproduce the error on your end, perhaps we can spot the issue in that (for example, do you use locale; or something similar?). Also, see my node here for tips on how to help debug some Unicode issues, in particular, in this case you should use Devel::Peek's Dump to inspect the variable in question (and post that output here too).

Re: multi byte character issue
by NERDVANA (Priest) on Jan 03, 2020 at 21:30 UTC
    As far as I know, perl ignores the LC environment variables. Is it possible that you have perl reading multibyte characters as multiple bytes and passing those bytes verbatim to the file, and then the LC variables are affecting how you view those bytes? If that’s the case, your perl program is producing the correct output (for the wrong reasons) but maybe you don’t really need to do anything about it.