in reply to Character encoding fun...
None of the "wide" Cyrillic code points in Unicode (U+0400 - U+52F) can be mapped/converted to cp1252 -- you'll just get a bunch of errors or "?" characters. For that matter, your unicode/utf8 input data might contain other stuff that is neither ASCII nor Cyrillic (or at least does not exist in cp1251), in which case, you would still be making a mess when you try to convert to cp1251.
The basic problem is: utf8 can store a lot of different characters besides ascii and Cyrillic (or ascii and Latin, or ascii and Greek, or ...), and now people are getting used to the idea of creating text data that has more than just the 200 (give-or-take) displayable characters that are available in any chosen 8-bit code page (like cp125whatever, or iso-8859-whatever).
It might be worth your while to probe your data to see what it really contains -- try one or both of these tools to see what you have: tlu -- TransLiterate Unicode, unichist -- count/summarize characters in data. Maybe it will suffice to "hand select" some appropriate "replacement" characters for some of the code points not available in cp1251.
If you are using a database like mysql or whatever, you could probably just store the utf8 character string as a raw byte stream, and just not do anything with the encoding -- treat the data as raw binary stuff for insertion and selection, and only worry about encoding at the display end.
If you need to be able to query the database using (sub)string matches on a Cyrillic field, you should still be able to do that, so long as you treat the search string the same way to treated the data when you inserted it -- as a string of raw binary bytes. (How the user provides the string and sees the results is a separate thing, unrelated to how the database handles it.)
I think the only time you would need to worry about getting the database to do the correct interpretation of the character encoding is if you need to sort / collate strings in a language-relevant manner (that is, if you have to worry about the cyrillic equivalent of "alphabetic" vs. "ascii-betic" sorting). In that case, I'm hoping you were mistaken when you said you are using a cp1252-based system, because you might have trouble doing Cyrillic-based stuff on a Latin1 system.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Character encoding fun...
by joem (Initiate) on Nov 16, 2007 at 15:15 UTC | |
by graff (Chancellor) on Nov 16, 2007 at 23:44 UTC |