in reply to Character encoding fun...

If you have Cyrllic text in utf8 encoding, you will not be able to encode it into cp1252, which is a Latin1 code page. You should try encoding it into cp1251, which is the microsoft single-byte encoding for Cyrillic.

None of the "wide" Cyrillic code points in Unicode (U+0400 - U+52F) can be mapped/converted to cp1252 -- you'll just get a bunch of errors or "?" characters. For that matter, your unicode/utf8 input data might contain other stuff that is neither ASCII nor Cyrillic (or at least does not exist in cp1251), in which case, you would still be making a mess when you try to convert to cp1251.

The basic problem is: utf8 can store a lot of different characters besides ascii and Cyrillic (or ascii and Latin, or ascii and Greek, or ...), and now people are getting used to the idea of creating text data that has more than just the 200 (give-or-take) displayable characters that are available in any chosen 8-bit code page (like cp125whatever, or iso-8859-whatever).

It might be worth your while to probe your data to see what it really contains -- try one or both of these tools to see what you have: tlu -- TransLiterate Unicode, unichist -- count/summarize characters in data. Maybe it will suffice to "hand select" some appropriate "replacement" characters for some of the code points not available in cp1251.

If you are using a database like mysql or whatever, you could probably just store the utf8 character string as a raw byte stream, and just not do anything with the encoding -- treat the data as raw binary stuff for insertion and selection, and only worry about encoding at the display end.

If you need to be able to query the database using (sub)string matches on a Cyrillic field, you should still be able to do that, so long as you treat the search string the same way to treated the data when you inserted it -- as a string of raw binary bytes. (How the user provides the string and sees the results is a separate thing, unrelated to how the database handles it.)

I think the only time you would need to worry about getting the database to do the correct interpretation of the character encoding is if you need to sort / collate strings in a language-relevant manner (that is, if you have to worry about the cyrillic equivalent of "alphabetic" vs. "ascii-betic" sorting). In that case, I'm hoping you were mistaken when you said you are using a cp1252-based system, because you might have trouble doing Cyrillic-based stuff on a Latin1 system.

Replies are listed 'Best First'.
Re^2: Character encoding fun...
by joem (Initiate) on Nov 16, 2007 at 15:15 UTC
    Hello graff,
    This is most insightful. I have a file which contains UTF-8 characters for several languages. It is a translation table for our app. Within this file there is english, spanish, french, russian, and chinese. When I view the file with a UTF capable editor I see all the characters as they should appear.

    The problem arises when this file's data is read by perl and pushed into a database (Oracle) varchar2 field. The encoding for the database is actually WE8ISO8859P15. I just need to store the UTF characters in these fields without having to manipulate them.

    The concept of storing the data as a raw byte stream is what I wanted to accomplish though I'm almost certain that using the Encode functions is not getting me there. How can I take the UTF data and convert it from a "string" to a byte stream in perl?

    Thanks,
    Joe
      The concept of storing the data as a raw byte stream is what I wanted to accomplish though I'm almost certain that using the Encode functions is not getting me there.

      I think you should not have to use the Encode functions at all in order to put the data into the database. I could be wrong, but if you just put the variable(s) containing the utf8 string(s) as the arg(s) you pass to the  sth->execute() call (you are using placeholders, aren't you?), it should do the right thing -- oracle won't know anything about perl's internal ut8 flag, and and doesn't need to know. The string(s) should just go into the table column(s) without further ado.

      (The only issue where I might be wrong about that is if your oracle setup happens to behave strangely when given characters in the range 0x80-0x9f; a lot of the utf8 "continuation" (non-initial) bytes are likely to be in this range, and for some interpretations of "ISO-8859", they are either given some sort of special treatment (e.g. "interpreted" as control characters with strange side effects), or else they are not supposed to exist. But I don't think a varchar2 field in oracle is going to be finicky in this way.)

      When you query to get data back from the database, you'll need to do something like  $utf8_str = decode( "utf8", $db_string ) to tell perl that the string is utf8 data.