in reply to Re^3: Character encoding fun...
in thread Character encoding fun...

So the way is starting to seem a little clearer.
To take the following path

UTF_String -> CP1252 data -> UTF_String
Source      -> Storage       -> Display

I will need to
encode('cp1252', $my_utf_data) -> store -> decode('utf8', $retrieved_d +ata)
though that does not provide any results per se.

I must admit I am very new in the character encoding scene, It would be nice to be able to just manipulate the strings as byte arrays for storage and retrieval though I'm not sure how to accomplish that in perl :(

Thanks,
Joe

Replies are listed 'Best First'.
Re^5: Character encoding fun...
by graff (Chancellor) on Nov 16, 2007 at 02:54 UTC
    As indicated in my note below (which was a reply to the anonymous OP), you have to stop talking about cp1252 (Latin1), and switch to cp1251 (Cyrillic) instead -- that alone might account for some of the problems you are having.

    Apart from that, if the data originally comes from a file (or other external source) as utf8 text, your perl script first has to be made aware that it is utf8 data, either via  open($fh,"<:utf8",$file), or via $utf8_string=decode('utf8',$input_string).

    Then you encode( "cp1251", $utf8_string ) and use the resulting string as the input to your non-unicode database. On getting stuff back from the database, do $utf8_string=decode("cp1251", $db_string) to get back to your original utf8 Cyrillic string.

    But if the original utf8 Cyrillic string included any character(s) that do not exist in the cp1251 character set, those things will not survive the conversion into cp1251, period.

    In that case, you'll need to replace the "unmappable" characters in question with suitable substitutes, if possible, and that will probably involve some manual inspection and decisions about what sort of replacement(s) would be suitable...

    (updated in hopes of making things clearer)