Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

UTF-8, Oracle and Perl life

by Akira71 (Scribe)
on Oct 07, 2002 at 16:48 UTC ( [id://203392]=perlquestion: print w/replies, xml ) Need Help??

Akira71 has asked for the wisdom of the Perl Monks concerning the following question:

First I am new to Perl and I have searched everywhere for an answer to this question without much luck, but I thought I would try you enlightened people.

I have a new application done entirely in Java(groan) that works in Japanese for a major corporation. It turns out though, that the reporting structure for the company (extracts, formatting, etc..) is done entirely in Perl 5.6.0. This is what Perl is excellent at doing so it should be fine. Well the problem boils down to legacy system and I am not sure what to do about it. I have Perl 5.6.0, Oraperl module to connect to Perl and JCODE to deal with the Japanese encodings.

All should be well, but basically what I am finding out is that Perl, even with Use UTF8 does not use UTF8 well. We are extracting garbage out of the database that is consistent with the ASCII representation of UTF8 / ShiftJIS data. In otherwords, if you do not have the font, then this how that kanji character will look. Oracle 8i is set to run in UTF8 and it is preserving the encoding correctly and Java seems to deal with the data correctly (our application runs in Japanese) but Perl does nothing with this information. First if we write it out to a text file, all encoding is lost. I know there is a way to ensure that UTF8 is used for output in Perl 5.8 but we cannot upgrade. Second it actually appears that Oraperl is dropping the encoding information whenever we retrieve data. I am fairly certain of this because when we call JCODE to convert the character sets, the input is there, but then it does not tranform the characters and it ends up having an empty string.

What I am asking, is if anyone has experience using UTF8, Oracle 8i, Perl 5.6, Oraperl and in Japanese can shed some enlightenment my way. I am having a heck of a time.
Thank you very much in advance of any information,

Akira Yamashita

Replies are listed 'Best First'.
Re: UTF-8, Oracle and Perl life
by sharkey (Scribe) on Oct 07, 2002 at 21:54 UTC
    Your database may be using UTF8, but if your clients are anything like ours (windows), they are probably connecting with the client charater set set to:

    NLS_LANG=JAPANESE_JAPAN.JA16SJIS

    Which means the Oracle client is translating the characters to UTF8 when it sends them to the database and back to SJIS when it retrieves them. Windows still doesn't have a decent UTF8 IME for Japanese, at least according to our Japanese users.

    For my Japanese web interface, I connect to the database using the same NLS_LANG setting, and treat the data as binary data. Oracle gives me SJIS data, perl doesn't change it, and the clients get the SJIS characters they are expecting.

    If you really are using UTF8 on the client, you should be able to connect with your client character set as NLS_LANG=(whatever).UTF8 and get UTF8 from the database. But I would make darn sure that your clients really doing that. (Everyone here said we were using UTF8(*), but when I checked the actual windows machines, they were using JA16SJIS, WE8ISO8859P1, and other local character sets(**) on the clients.)

    Endnotes

    (*) Your server is probably using UTF8 as the character set, which is important because Oracle can translate any other character set to UTF8 and back losslessly. But your client character set is probably whatever your client systems can display most easily.

    (**) Additional cautions on Oracle client character sets: the NLS_LANG environment variable is the only way to set it, and there can only be one character set per client process. I have to start completely separate apache servers in order to handle multiple client character sets.

      Having taken some of the suggestions above it looks like we are working fine. But it is true that we use Shift-JIS for web input/output through our Java Web Service application. The Oracle Database is in UTF-8 however. I have never even thought about the scenario you just mentioned since everything is working so well. I am used to the Java world where this difference is transparent to us. I greatly appreciate your pointing this out and will send this information to our DB Admins back in America.

      Thank you, Akira

Re: UTF-8, Oracle and Perl life
by wardk (Deacon) on Oct 07, 2002 at 19:33 UTC
    Haven't worked with Japanese, but I have experienced Perl/Oraperl issues with Oracle related to character sets.

    My only advice would be to ensure that you have the following Oracle environment settings set in your Perl environment prior to connecting

    ORA_NLS ORA_NLS33 NLS_LANG
    this can be done via the following (your values will obviously differ)
    $ENV{ORA_NLS}="/opt/oracle/product/8.1.7/migrate/nls/admin/data"; $ENV{ORA_NLS33}="/opt/oracle/product/8.1.7/migrate/nls/admin/data"; $ENV{NLS_LANG}="AMERICAN_AMERICA.US7ASCII";
    Hope this is helpful...good luck!
      Actually the database in Oracle is setup and functioning correctly in UTF-8 through the ORA_NLS. The application we wrote using various web service technologies (except for anything .NET :-) ) works wonderfully. It is only the extract we have with problems with.

      In answer to the other reply above, yes we are using Oraperl. I was told we could not upgrade anything on the box originally. However, I did identify the possibility that Oraperl is to blame and have found the right Unix admin to add DBD:Oracle to the system so we can use it. Unfortunately the box is on the other part of the world for us, so this takes time for us to implement. Admitedly this is one of my only foray's into Perl and I love what I see so far. I am a C++ / Java person and now this has my interest totally.

      Thank you for support and helpful hints!

      Akira

        Is your current extract script either setting the vars I mentioned or inheriting them from the calling environment? for instance, if the job is running from cron, you will absolutely need to set this in the perl script. This is independent from the database setup, and your perl won't inherit the settings from Oracle. They will need to be in the calling environment (set perhaps in a calling shell script) or setting them specifically in the perl.
Re: UTF-8, Oracle and Perl life
by sth (Priest) on Oct 07, 2002 at 19:31 UTC
    Well I've do not have experience using Japanese or UTF8, but you mention oraperl. I hope you are using DBI/DBD and not oraperl which is a very very old oracle interface for perl4. The current versions for DBI/DBD are 1.30/1.12, using these will give you the best chance for the UTF8 select to work. STH
Re: UTF-8, Oracle and Perl life
by John M. Dlugosz (Monsignor) on Oct 07, 2002 at 20:06 UTC
    Perl's UTF-8 support breaks down only when it comes to char vs. byte orientation issues, and higher concepts like what a dot in a regex should match (more than one byte). It won't translate the string into another character set or otherwise munge it! Writing it out to a file will always write that sequence of bytes.

    Perl doesn't know about "formatting information". So I think the problem must be in the interface, not with Perl 5.6.0 itself (so going to 5.8 won't help).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://203392]
Approved by enoch
Front-paged by krisahoch
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-04-25 05:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found