Re^2: Character encoding fun...

Replies are listed 'Best First'.
Re^3: Character encoding fun... by pc88mxer (Vicar) on Nov 15, 2007 at 20:48 UTC
Your problem is that `$my_utf_data` contains code points (numbers representing Unicode characters), not octets (i.e. bytes). If `$my_utf_data` really contains bytes, no character in that string should be > 255. The error message you are getting indicates that there are characters > 255 in your string. If `$my_utf_data` is really text (i.e. consists of code points), then all you need is the call to `encode` to get a cp1252 encoded stream of bytes: `encode('cp1252', $my_utf_data)` [download]	[reply] [d/l] [select]
Re^4: Character encoding fun... by joem (Initiate) on Nov 15, 2007 at 21:02 UTC
So the way is starting to seem a little clearer. To take the following path UTF_String -> CP1252 data -> UTF_String Source -> Storage -> Display I will need to `encode('cp1252', $my_utf_data) -> store -> decode('utf8', $retrieved_d +ata)` [download] though that does not provide any results per se. I must admit I am very new in the character encoding scene, It would be nice to be able to just manipulate the strings as byte arrays for storage and retrieval though I'm not sure how to accomplish that in perl :( Thanks, Joe	[reply] [d/l]
Re^5: Character encoding fun... by graff (Chancellor) on Nov 16, 2007 at 02:54 UTC
As indicated in my note below (which was a reply to the anonymous OP), you have to stop talking about cp1252 (Latin1), and switch to cp1251 (Cyrillic) instead -- that alone might account for some of the problems you are having. Apart from that, if the data originally comes from a file (or other external source) as utf8 text, your perl script first has to be made aware that it is utf8 data, either via `open($fh,"<:utf8",$file)`, or via `$utf8_string=decode('utf8',$input_string)`. Then you `encode( "cp1251", $utf8_string )` and use the resulting string as the input to your non-unicode database. On getting stuff back from the database, do `$utf8_string=decode("cp1251", $db_string)` to get back to your original utf8 Cyrillic string. But if the original utf8 Cyrillic string included any character(s) that do not exist in the cp1251 character set, those things will not survive the conversion into cp1251, period. In that case, you'll need to replace the "unmappable" characters in question with suitable substitutes, if possible, and that will probably involve some manual inspection and decisions about what sort of replacement(s) would be suitable... (updated in hopes of making things clearer)	[reply] [d/l] [select]