in reply to Re: Character encoding fun...
in thread Character encoding fun...

Hello graff,
This is most insightful. I have a file which contains UTF-8 characters for several languages. It is a translation table for our app. Within this file there is english, spanish, french, russian, and chinese. When I view the file with a UTF capable editor I see all the characters as they should appear.

The problem arises when this file's data is read by perl and pushed into a database (Oracle) varchar2 field. The encoding for the database is actually WE8ISO8859P15. I just need to store the UTF characters in these fields without having to manipulate them.

The concept of storing the data as a raw byte stream is what I wanted to accomplish though I'm almost certain that using the Encode functions is not getting me there. How can I take the UTF data and convert it from a "string" to a byte stream in perl?

Thanks,
Joe

Replies are listed 'Best First'.
Re^3: Character encoding fun...
by graff (Chancellor) on Nov 16, 2007 at 23:44 UTC
    The concept of storing the data as a raw byte stream is what I wanted to accomplish though I'm almost certain that using the Encode functions is not getting me there.

    I think you should not have to use the Encode functions at all in order to put the data into the database. I could be wrong, but if you just put the variable(s) containing the utf8 string(s) as the arg(s) you pass to the  sth->execute() call (you are using placeholders, aren't you?), it should do the right thing -- oracle won't know anything about perl's internal ut8 flag, and and doesn't need to know. The string(s) should just go into the table column(s) without further ado.

    (The only issue where I might be wrong about that is if your oracle setup happens to behave strangely when given characters in the range 0x80-0x9f; a lot of the utf8 "continuation" (non-initial) bytes are likely to be in this range, and for some interpretations of "ISO-8859", they are either given some sort of special treatment (e.g. "interpreted" as control characters with strange side effects), or else they are not supposed to exist. But I don't think a varchar2 field in oracle is going to be finicky in this way.)

    When you query to get data back from the database, you'll need to do something like  $utf8_str = decode( "utf8", $db_string ) to tell perl that the string is utf8 data.