comment on

If you have Cyrllic text in utf8 encoding, you will not be able to encode it into cp1252, which is a Latin1 code page. You should try encoding it into cp1251, which is the microsoft single-byte encoding for Cyrillic.

None of the "wide" Cyrillic code points in Unicode (U+0400 - U+52F) can be mapped/converted to cp1252 -- you'll just get a bunch of errors or "?" characters. For that matter, your unicode/utf8 input data might contain other stuff that is neither ASCII nor Cyrillic (or at least does not exist in cp1251), in which case, you would still be making a mess when you try to convert to cp1251.

The basic problem is: utf8 can store a lot of different characters besides ascii and Cyrillic (or ascii and Latin, or ascii and Greek, or ...), and now people are getting used to the idea of creating text data that has more than just the 200 (give-or-take) displayable characters that are available in any chosen 8-bit code page (like cp125whatever, or iso-8859-whatever).

It might be worth your while to probe your data to see what it really contains -- try one or both of these tools to see what you have: tlu -- TransLiterate Unicode, unichist -- count/summarize characters in data. Maybe it will suffice to "hand select" some appropriate "replacement" characters for some of the code points not available in cp1251.

If you are using a database like mysql or whatever, you could probably just store the utf8 character string as a raw byte stream, and just not do anything with the encoding -- treat the data as raw binary stuff for insertion and selection, and only worry about encoding at the display end.

If you need to be able to query the database using (sub)string matches on a Cyrillic field, you should still be able to do that, so long as you treat the search string the same way to treated the data when you inserted it -- as a string of raw binary bytes. (How the user provides the string and sees the results is a separate thing, unrelated to how the database handles it.)

I think the only time you would need to worry about getting the database to do the correct interpretation of the character encoding is if you need to sort / collate strings in a language-relevant manner (that is, if you have to worry about the cyrillic equivalent of "alphabetic" vs. "ascii-betic" sorting). In that case, I'm hoping you were mistaken when you said you are using a cp1252-based system, because you might have trouble doing Cyrillic-based stuff on a Latin1 system.

In reply to Re: Character encoding fun... by graff
in thread Character encoding fun... by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.