in reply to Re^2: Decoding Russian text
in thread Decoding Russian text

Most versions of Microsoft Internet Explorer contain herustics to guess the encoding of web pages where the encoding is unknown. It works on statistical methods based on the letter frequency in different languages.

You could try wraping your text in basic html tags, and then loading them into MSIE and seeing which encoding is detected, and if all the texts are detected with the same encoding. (I assume you have at least a rudamentary knowlege of Russan, so you can tell if herustics have got it wrong and produced rubish).

If that does not work, or if your documents all have different encodings, then you will need to come up with some heuristics of your own. My suggestion would be to try out all the likey possiblities (using ikegami's code), and compare the output with a wordlist of common russian words, taken from your system's spellcheker dictionary.

Replies are listed 'Best First'.
Re^4: Decoding Russian text
by Jim (Curate) on Jul 14, 2011 at 01:03 UTC

    For interactively exploring the character encodings of text, I like BabelPad. It's a Unicode text editor, but it recognizes and automatically detects many legacy encodings.

    No one has mentioned the Perl modules Encode::Guess (core) or Encode::Detect (CPAN) yet.

    The Cyrillic text is most likely in one of the encodings KOI8-R, Windows-1251, or ISO 8859-5. (Probably KOI8-R, but that's just a guess.)

    Jim

      Encode::Guess says it can't distinguish between single-byte encodings.

      I'll definitely start mentioning Encode::Detect which I hadn't heard of before.

        I thought to mention the Perl encoding detection modules specifically because your script probably doesn't distinguish between single-byte encodings such as KOI8-R and Windows-1251. (Or does it?) vit seems to be asking for automated encoding detection in Perl that can at least distinguish between the common Cyrillic legacy encodings, which are single-byte encodings. In the general case, this is a sticky wicket.

        Needless to say, character encoding guessing is guesswork. And disambiguating characters in single-byte encodings is much more guess-y than it is in multi-byte encodings.

        (For fun, read Bush hid the facts.)

        Jim