Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I am currently running into a bit of a problem with text encoding issues. The story is that there is this data whose character encoding is UTF-8, the data happens to be made up of cyrillic characters. This data must reside on a database whose character encoding is cp1252 (or windows-1252).

I have used Encode to
decode('cp1252', encode('utf8', $my_utf_data));

to insert into the database, then the reverse when reading it out.

The problem is that there are a few characters which got 'garbled', the cyrillic characters ya and es are now shown as a w with a squiggly over them. Is there a different way of taking the utf8 bytes and storing them as cp1252 bytes so that they may later be read as cp1252 bytes and converted to utf8 bytes for display?
Thanks,

your help is much appreciated,

Joe

Replies are listed 'Best First'.
Re: Character encoding fun...
by graff (Chancellor) on Nov 16, 2007 at 02:28 UTC
    If you have Cyrllic text in utf8 encoding, you will not be able to encode it into cp1252, which is a Latin1 code page. You should try encoding it into cp1251, which is the microsoft single-byte encoding for Cyrillic.

    None of the "wide" Cyrillic code points in Unicode (U+0400 - U+52F) can be mapped/converted to cp1252 -- you'll just get a bunch of errors or "?" characters. For that matter, your unicode/utf8 input data might contain other stuff that is neither ASCII nor Cyrillic (or at least does not exist in cp1251), in which case, you would still be making a mess when you try to convert to cp1251.

    The basic problem is: utf8 can store a lot of different characters besides ascii and Cyrillic (or ascii and Latin, or ascii and Greek, or ...), and now people are getting used to the idea of creating text data that has more than just the 200 (give-or-take) displayable characters that are available in any chosen 8-bit code page (like cp125whatever, or iso-8859-whatever).

    It might be worth your while to probe your data to see what it really contains -- try one or both of these tools to see what you have: tlu -- TransLiterate Unicode, unichist -- count/summarize characters in data. Maybe it will suffice to "hand select" some appropriate "replacement" characters for some of the code points not available in cp1251.

    If you are using a database like mysql or whatever, you could probably just store the utf8 character string as a raw byte stream, and just not do anything with the encoding -- treat the data as raw binary stuff for insertion and selection, and only worry about encoding at the display end.

    If you need to be able to query the database using (sub)string matches on a Cyrillic field, you should still be able to do that, so long as you treat the search string the same way to treated the data when you inserted it -- as a string of raw binary bytes. (How the user provides the string and sees the results is a separate thing, unrelated to how the database handles it.)

    I think the only time you would need to worry about getting the database to do the correct interpretation of the character encoding is if you need to sort / collate strings in a language-relevant manner (that is, if you have to worry about the cyrillic equivalent of "alphabetic" vs. "ascii-betic" sorting). In that case, I'm hoping you were mistaken when you said you are using a cp1252-based system, because you might have trouble doing Cyrillic-based stuff on a Latin1 system.

      Hello graff,
      This is most insightful. I have a file which contains UTF-8 characters for several languages. It is a translation table for our app. Within this file there is english, spanish, french, russian, and chinese. When I view the file with a UTF capable editor I see all the characters as they should appear.

      The problem arises when this file's data is read by perl and pushed into a database (Oracle) varchar2 field. The encoding for the database is actually WE8ISO8859P15. I just need to store the UTF characters in these fields without having to manipulate them.

      The concept of storing the data as a raw byte stream is what I wanted to accomplish though I'm almost certain that using the Encode functions is not getting me there. How can I take the UTF data and convert it from a "string" to a byte stream in perl?

      Thanks,
      Joe
        The concept of storing the data as a raw byte stream is what I wanted to accomplish though I'm almost certain that using the Encode functions is not getting me there.

        I think you should not have to use the Encode functions at all in order to put the data into the database. I could be wrong, but if you just put the variable(s) containing the utf8 string(s) as the arg(s) you pass to the  sth->execute() call (you are using placeholders, aren't you?), it should do the right thing -- oracle won't know anything about perl's internal ut8 flag, and and doesn't need to know. The string(s) should just go into the table column(s) without further ado.

        (The only issue where I might be wrong about that is if your oracle setup happens to behave strangely when given characters in the range 0x80-0x9f; a lot of the utf8 "continuation" (non-initial) bytes are likely to be in this range, and for some interpretations of "ISO-8859", they are either given some sort of special treatment (e.g. "interpreted" as control characters with strange side effects), or else they are not supposed to exist. But I don't think a varchar2 field in oracle is going to be finicky in this way.)

        When you query to get data back from the database, you'll need to do something like  $utf8_str = decode( "utf8", $db_string ) to tell perl that the string is utf8 data.

Re: Character encoding fun...
by joem (Initiate) on Nov 15, 2007 at 19:59 UTC
    Sorry forgot to log in, thoguh this message was from me.
    Joe
Re: Character encoding fun...
by pc88mxer (Vicar) on Nov 15, 2007 at 20:26 UTC
    Are you sure that you are using decode('cp1215', encode('utf8', ...)? I'm not sure it ever makes sense to do that.

    I think you want:

    encode('cp1252', decode('utf8', $my_utf_data))
      Hello,
      Thanks for the quick response.
      I though that's what I wanted though when I do that I get:
      Cannot decode string with wide characters at C:/Perl588/lib/Encode.pm +line 166.

      which is why it's turned around.

      Joe
        Your problem is that $my_utf_data contains code points (numbers representing Unicode characters), not octets (i.e. bytes).

        If $my_utf_data really contains bytes, no character in that string should be > 255. The error message you are getting indicates that there are characters > 255 in your string.

        If $my_utf_data is really text (i.e. consists of code points), then all you need is the call to encode to get a cp1252 encoded stream of bytes:

        encode('cp1252', $my_utf_data)
Re: Character encoding fun...
by joem (Initiate) on Nov 19, 2007 at 21:32 UTC
    Hello all,

    Thanks to those which help me shed some light on this subject. In the end the triumph went to keeping things "fairly" simple.

    To encode the UTF text for storage in the cp1252 database
    . . . use Unicode::String; my $utf_string = [some string which is UTF8]; my $encased_string = Unicode::String->new($utf_string); my $data_for_database = $encased_string->hex(); # For better space savings... $data_for_database =~ s/U\+//g;
    for decoding the data back from the db
    my $encoded_string = [string data from db saved from previous step]; my $encased_string = Unicode::String->new(); $encased_string->hex($encoded_string); my $utf8_string = Encode::decode_utf8($encased_string->utf8());

    Thanks all for your help,

    Joe