Re: Decoding Russian text

Replies are listed 'Best First'.
Re^2: Decoding Russian text by vit (Friar) on Jul 13, 2011 at 19:56 UTC
Replace "XXX" with the encoding you used. But I do not this. Is it possible to somehow detect automatically?	[reply]
Re^3: Decoding Russian text by chrestomanci (Priest) on Jul 13, 2011 at 20:58 UTC
Most versions of Microsoft Internet Explorer contain herustics to guess the encoding of web pages where the encoding is unknown. It works on statistical methods based on the letter frequency in different languages. You could try wraping your text in basic html tags, and then loading them into MSIE and seeing which encoding is detected, and if all the texts are detected with the same encoding. (I assume you have at least a rudamentary knowlege of Russan, so you can tell if herustics have got it wrong and produced rubish). If that does not work, or if your documents all have different encodings, then you will need to come up with some heuristics of your own. My suggestion would be to try out all the likey possiblities (using ikegami's code), and compare the output with a wordlist of common russian words, taken from your system's spellcheker dictionary.	[reply]
Re^4: Decoding Russian text by Jim (Curate) on Jul 14, 2011 at 01:03 UTC
For interactively exploring the character encodings of text, I like BabelPad. It's a Unicode text editor, but it recognizes and automatically detects many legacy encodings. No one has mentioned the Perl modules Encode::Guess (core) or Encode::Detect (CPAN) yet. The Cyrillic text is most likely in one of the encodings KOI8-R, Windows-1251, or ISO 8859-5. (Probably KOI8-R, but that's just a guess.) Jim	[reply]
Re^5: Decoding Russian text by ikegami (Patriarch) on Jul 14, 2011 at 02:07 UTC
Re^6: Decoding Russian text by Jim (Curate) on Jul 14, 2011 at 19:27 UTC
Some notes below your chosen depth have not been shown here
Re^3: Decoding Russian text by ikegami (Patriarch) on Jul 13, 2011 at 21:16 UTC
You could try different encodings until you find one that works. This outputs the `file` decoded using a variety of encodings. It'll be easier to read if `file` only contains one line. `perl -MEncode -E' binmode(STDIN); binmode(STDOUT, ":encoding(UTF-8)"); $_ = do { local $/; <STDIN> }; for my $enc (Encode->encodings(":all")) { my $dec = eval { decode($enc, $_, Encode::FB_CROAK \| Encode::LEAVE_SRC) }; if (defined($dec)) { say "$enc: $dec"; } else { print "$enc: Fail: $@"; } } ' < file` [download] Replace `UTF-8` with the encoding your terminal expects.	[reply] [d/l] [select]
Re^4: Decoding Russian text by vit (Friar) on Jul 13, 2011 at 21:40 UTC
By some reason if I put :encoding(UTF-8) to both STDIN and STDOUT it works fine. But I want to apply encoding to a single string, not to an input stream. How can I do this ?	[reply]
Re^5: Decoding Russian text by ikegami (Patriarch) on Jul 13, 2011 at 23:00 UTC