UTF-8, Oracle and Perl life

Akira71 has asked for the wisdom of the Perl Monks concerning the following question:

First I am new to Perl and I have searched everywhere for an answer to this question without much luck, but I thought I would try you enlightened people.

I have a new application done entirely in Java(groan) that works in Japanese for a major corporation. It turns out though, that the reporting structure for the company (extracts, formatting, etc..) is done entirely in Perl 5.6.0. This is what Perl is excellent at doing so it should be fine. Well the problem boils down to legacy system and I am not sure what to do about it. I have Perl 5.6.0, Oraperl module to connect to Perl and JCODE to deal with the Japanese encodings.

All should be well, but basically what I am finding out is that Perl, even with Use UTF8 does not use UTF8 well. We are extracting garbage out of the database that is consistent with the ASCII representation of UTF8 / ShiftJIS data. In otherwords, if you do not have the font, then this how that kanji character will look. Oracle 8i is set to run in UTF8 and it is preserving the encoding correctly and Java seems to deal with the data correctly (our application runs in Japanese) but Perl does nothing with this information. First if we write it out to a text file, all encoding is lost. I know there is a way to ensure that UTF8 is used for output in Perl 5.8 but we cannot upgrade. Second it actually appears that Oraperl is dropping the encoding information whenever we retrieve data. I am fairly certain of this because when we call JCODE to convert the character sets, the input is there, but then it does not tranform the characters and it ends up having an empty string.

What I am asking, is if anyone has experience using UTF8, Oracle 8i, Perl 5.6, Oraperl and in Japanese can shed some enlightenment my way. I am having a heck of a time.
Thank you very much in advance of any information,

Akira Yamashita

Comment on UTF-8, Oracle and Perl life

Replies are listed 'Best First'.
Re: UTF-8, Oracle and Perl life by sharkey (Scribe) on Oct 07, 2002 at 21:54 UTC
Your database may be using UTF8, but if your clients are anything like ours (windows), they are probably connecting with the client charater set set to: NLS_LANG=JAPANESE_JAPAN.JA16SJIS Which means the Oracle client is translating the characters to UTF8 when it sends them to the database and back to SJIS when it retrieves them. Windows still doesn't have a decent UTF8 IME for Japanese, at least according to our Japanese users. For my Japanese web interface, I connect to the database using the same NLS_LANG setting, and treat the data as binary data. Oracle gives me SJIS data, perl doesn't change it, and the clients get the SJIS characters they are expecting. If you really are using UTF8 on the client, you should be able to connect with your client character set as NLS_LANG=(whatever).UTF8 and get UTF8 from the database. But I would make darn sure that your clients really doing that. (Everyone here said we were using UTF8(), but when I checked the actual windows machines, they were using JA16SJIS, WE8ISO8859P1, and other local character sets() on the clients.) Endnotes () Your server is probably using UTF8 as the character set, which is important because Oracle can translate any other character set to UTF8 and back losslessly. But your client character set is probably whatever your client systems can display most easily. (*) Additional cautions on Oracle client character sets: the NLS_LANG environment variable is the only* way to set it, and there can only be one character set per client process. I have to start completely separate apache servers in order to handle multiple client character sets.	[reply]
Re: Re: UTF-8, Oracle and Perl life by Anonymous Monk on Oct 08, 2002 at 13:46 UTC
Having taken some of the suggestions above it looks like we are working fine. But it is true that we use Shift-JIS for web input/output through our Java Web Service application. The Oracle Database is in UTF-8 however. I have never even thought about the scenario you just mentioned since everything is working so well. I am used to the Java world where this difference is transparent to us. I greatly appreciate your pointing this out and will send this information to our DB Admins back in America. Thank you, Akira	[reply]
Re: UTF-8, Oracle and Perl life by wardk (Deacon) on Oct 07, 2002 at 19:33 UTC
Haven't worked with Japanese, but I have experienced Perl/Oraperl issues with Oracle related to character sets. My only advice would be to ensure that you have the following Oracle environment settings set in your Perl environment prior to connecting `ORA_NLS ORA_NLS33 NLS_LANG` [download] this can be done via the following (your values will obviously differ) `$ENV{ORA_NLS}="/opt/oracle/product/8.1.7/migrate/nls/admin/data"; $ENV{ORA_NLS33}="/opt/oracle/product/8.1.7/migrate/nls/admin/data"; $ENV{NLS_LANG}="AMERICAN_AMERICA.US7ASCII";` [download] Hope this is helpful...good luck!	[reply] [d/l] [select]
Re: Re: UTF-8, Oracle and Perl life by Akira71 (Scribe) on Oct 07, 2002 at 19:40 UTC
Actually the database in Oracle is setup and functioning correctly in UTF-8 through the ORA_NLS. The application we wrote using various web service technologies (except for anything .NET :-) ) works wonderfully. It is only the extract we have with problems with. In answer to the other reply above, yes we are using Oraperl. I was told we could not upgrade anything on the box originally. However, I did identify the possibility that Oraperl is to blame and have found the right Unix admin to add DBD:Oracle to the system so we can use it. Unfortunately the box is on the other part of the world for us, so this takes time for us to implement. Admitedly this is one of my only foray's into Perl and I love what I see so far. I am a C++ / Java person and now this has my interest totally. Thank you for support and helpful hints! Akira	[reply]
Re: Re: Re: UTF-8, Oracle and Perl life by wardk (Deacon) on Oct 07, 2002 at 19:53 UTC
Is your current extract script either setting the vars I mentioned or inheriting them from the calling environment? for instance, if the job is running from cron, you will absolutely need to set this in the perl script. This is independent from the database setup, and your perl won't inherit the settings from Oracle. They will need to be in the calling environment (set perhaps in a calling shell script) or setting them specifically in the perl.	[reply]
Re: Re: Re: Re: UTF-8, Oracle and Perl life by Akira71 (Scribe) on Oct 07, 2002 at 20:05 UTC
Re: Re: Re: Re: Re: UTF-8, Oracle and Perl life by rdfield (Priest) on Oct 08, 2002 at 08:40 UTC
Re: UTF-8, Oracle and Perl life by sth (Priest) on Oct 07, 2002 at 19:31 UTC
Well I've do not have experience using Japanese or UTF8, but you mention oraperl. I hope you are using DBI/DBD and not oraperl which is a very very old oracle interface for perl4. The current versions for DBI/DBD are 1.30/1.12, using these will give you the best chance for the UTF8 select to work. STH	[reply]
Re: UTF-8, Oracle and Perl life by John M. Dlugosz (Monsignor) on Oct 07, 2002 at 20:06 UTC
Perl's UTF-8 support breaks down only when it comes to char vs. byte orientation issues, and higher concepts like what a dot in a regex should match (more than one byte). It won't translate the string into another character set or otherwise munge it! Writing it out to a file will always write that sequence of bytes. Perl doesn't know about "formatting information". So I think the problem must be in the interface, not with Perl 5.6.0 itself (so going to 5.8 won't help).	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks