in reply to Re: Character encoding fun...
in thread Character encoding fun...

Hello,
Thanks for the quick response.
I though that's what I wanted though when I do that I get:
Cannot decode string with wide characters at C:/Perl588/lib/Encode.pm +line 166.

which is why it's turned around.

Joe

Replies are listed 'Best First'.
Re^3: Character encoding fun...
by pc88mxer (Vicar) on Nov 15, 2007 at 20:48 UTC
    Your problem is that $my_utf_data contains code points (numbers representing Unicode characters), not octets (i.e. bytes).

    If $my_utf_data really contains bytes, no character in that string should be > 255. The error message you are getting indicates that there are characters > 255 in your string.

    If $my_utf_data is really text (i.e. consists of code points), then all you need is the call to encode to get a cp1252 encoded stream of bytes:

    encode('cp1252', $my_utf_data)
      So the way is starting to seem a little clearer.
      To take the following path

      UTF_String -> CP1252 data -> UTF_String
      Source      -> Storage       -> Display

      I will need to
      encode('cp1252', $my_utf_data) -> store -> decode('utf8', $retrieved_d +ata)
      though that does not provide any results per se.

      I must admit I am very new in the character encoding scene, It would be nice to be able to just manipulate the strings as byte arrays for storage and retrieval though I'm not sure how to accomplish that in perl :(

      Thanks,
      Joe
        As indicated in my note below (which was a reply to the anonymous OP), you have to stop talking about cp1252 (Latin1), and switch to cp1251 (Cyrillic) instead -- that alone might account for some of the problems you are having.

        Apart from that, if the data originally comes from a file (or other external source) as utf8 text, your perl script first has to be made aware that it is utf8 data, either via  open($fh,"<:utf8",$file), or via $utf8_string=decode('utf8',$input_string).

        Then you encode( "cp1251", $utf8_string ) and use the resulting string as the input to your non-unicode database. On getting stuff back from the database, do $utf8_string=decode("cp1251", $db_string) to get back to your original utf8 Cyrillic string.

        But if the original utf8 Cyrillic string included any character(s) that do not exist in the cp1251 character set, those things will not survive the conversion into cp1251, period.

        In that case, you'll need to replace the "unmappable" characters in question with suitable substitutes, if possible, and that will probably involve some manual inspection and decisions about what sort of replacement(s) would be suitable...

        (updated in hopes of making things clearer)