Akira71 has asked for the wisdom of the Perl Monks concerning the following question:

OK fellow monks,

This is getting serious. AS you know I posted a question a few days ago about using Perl 5.6, Oracle 8i and Japanese scripts on here. I got some excellent suggestions and followed through and now I am having what seems to be an even more servious issue. This goes down to base encoding problems I imagine and might be a little off-topic for Perlmonks, but everyone here has been extremely helpful and knowledgable so here it goes.

It started out with an app written in Java and run from a browser. It connects to an Oracle8i database. The Oracle database is set for UTF8. The web browser (tested in Netscape, Opera, IE) are all set to shift-JIS as is the encoding in the HTML and JSP pages. I am assuming we are writing raw Shift-JIS data straight into the UTF-8 Database. When we use the web browser based application to read from said database and output to screen we get the correct Japanese everywhere.

Here is the real problem. When we do the same with Perl 5.6 we get garbage everwhere. I was able to use the suggestions from the Perlmonks in my last thread to set my environment variables all correctly and ensure everything was working. As a matter of fact, if I put the output of the Perl program into a ANSI text file and add HTML HEAD and BODY headers then the output is nearly perfect Japanese (I say nearly as some chracters are dropped. That is another issue.)

However, even though I can output raw UTF-8, I cannot make sense of any of the data unless it is output to browser. I am using TOAD and various other tools ot inspect the data in the database and as suspected (even on Japanese O/S) the data appears to be garbage ASCII characters. This is fine. The output from Perl is the exact same set of garbage charatcers, but of course with the bytes intact. What I need to know is how we can make this file viewable on a Japanese O/S without resorting to using a browser.

The Browser does correctly render the output and states it as Shift-JIS. I have tried a few Perl modules to say output UTF8 (raw works browser only), ShiftJIS(never works in anything), JIS or any other encoding it always fails. I have not been able to produce a single readable text document from Perl. Only UTF-8 raw output, auto-detected to Shift-JIS and in browser works. I am doing this form a terminal connected to a Solaris box and I know the output it is giving me from the database is correct.

Does anyone with experience in Japanese apps, Perl and Unix have any further ideas as to why I cannot output a delemited text file in Japanese. This is more than an issue of having the correct fonts installed on Unix. I only need to view the output in a text document on my Japanese PC here.

In Japan I have only used Java, C++ and JPerl and natively. I have never mixed them from a US server and web apps with Perl reporting backends with several language formats.

I am very much appreciative for any leads on this. I simply hate to be lost on a topic to this extent.,

Akira

P.S. Please forgive my less than adequate English or any misspellings. This takes much time for me to write and format correctly.

  • Comment on UTF8, Perl 5.6 and Oracle Revisited....

Replies are listed 'Best First'.
Re: UTF8, Perl 5.6 and Oracle Revisited....
by rdfield (Priest) on Oct 10, 2002 at 13:25 UTC
    I'm not completely sure what the actual question is, but I'll try and explain why I think your browser is seeing garbage when using Perl: if you are using Apache/mod_perl (or even Apache without mod_perl, I suppose) to talk to Oracle then you'll need to set NLS_LANG in your httpd.conf.

    rdfield

      Hmm, that is something to think about. We are using Oracle running on a Sun Solaris Server. The Perl implementation on it is 5.6 but I do not think it is Apache Mod_Perl. However, any hint that might help me solve this issue is welcome. I will look into it. I am also thinking that the ISO-2022-JP code could be damaged as happens from browser so we are applying the text cleaning techniques from Ken Lunde's excellent book.

      The question is not clear. I apologize. I am trying to to find out how to make sure I get good output file from the database that if I bring up in Japanese Word, or some Japanese text editor, will come up in Japanese. Only the browser is consistent with this. I did apply all the help from my previous question and that allowed me to get the raw UTF-8 data from the database, but it is not viewable as Kanji characters unless I put into web browser. The data is not the HTTP unicode #& format though so I know this is not issue.

      I hope that clears up some. And thank you very much for your response. :-)

        I've noticed problems viewing UTF-8 data in Word97, but the same data looked OK in Word2000. Generally speaking, all the language/UTF-8 work I've been involved with was via XML and the browser was the only was they only way we could reliably view the data. Word2000 uses some of the same libraries (another Monk may have more information/knowledge in this area...?) as IE and is much better than previous versions at displaying XML/UTF data.

        rdfield

Re: UTF8, Perl 5.6 and Oracle Revisited....
by l2kashe (Deacon) on Oct 11, 2002 at 04:17 UTC
    A thought tickled my brain, and I thought I would share, even though it might not help..
    I was dealing with compressing and decompressing data via Compress::Zlib, and doing some research. Apparently browsers are smart enough to take a compressed data stream from a server, and decompress it locally!! So, with that being said, it would stand to reason to get garbage everywhere except for a browser, if the browser were in fact decompressing prior to output. Now that was also in regards to the total HTTP transmisson, but I wouldnt put it past them to auto detect compressed data and simply decompress those components of the transmission stream..

    Might help, might be worthless, happy hacking

    /* And the Creator, against his better judgement, wrote man.c */
Re: UTF8, Perl 5.6 and Oracle Revisited....
by samgold (Scribe) on Oct 11, 2002 at 05:19 UTC
    Your best bet is to read the oracle docs on using different character sets. I have never administered a database that was not a US based character set. My best guess is that when you are trying to read the data in the database it is encoded and you need to decode it when you are querying the data. I hope this helps.