vit has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I generated some Russian text with a "wrong" character set. What should I do to transform it to a standard set or UTF-8. I may not formulated my question totally correct. Probably I should say I need the text to be recognized.

Replies are listed 'Best First'.
Re: Decoding Russian text
by ikegami (Patriarch) on Jul 13, 2011 at 19:10 UTC
    perl -pe' BEGIN { binmode(STDIN, ":encoding(XXX)"); binmode(STDOUT, ":encoding(UTF-8)"); } < file.XXX > file.utf-8 '

    Replace "XXX" with the encoding you used.

    See also: iconv utility.

      Replace "XXX" with the encoding you used.
      But I do not this. Is it possible to somehow detect automatically?

        Most versions of Microsoft Internet Explorer contain herustics to guess the encoding of web pages where the encoding is unknown. It works on statistical methods based on the letter frequency in different languages.

        You could try wraping your text in basic html tags, and then loading them into MSIE and seeing which encoding is detected, and if all the texts are detected with the same encoding. (I assume you have at least a rudamentary knowlege of Russan, so you can tell if herustics have got it wrong and produced rubish).

        If that does not work, or if your documents all have different encodings, then you will need to come up with some heuristics of your own. My suggestion would be to try out all the likey possiblities (using ikegami's code), and compare the output with a wordlist of common russian words, taken from your system's spellcheker dictionary.

        You could try different encodings until you find one that works.

        This outputs the file decoded using a variety of encodings. It'll be easier to read if file only contains one line.

        perl -MEncode -E' binmode(STDIN); binmode(STDOUT, ":encoding(UTF-8)"); $_ = do { local $/; <STDIN> }; for my $enc (Encode->encodings(":all")) { my $dec = eval { decode($enc, $_, Encode::FB_CROAK | Encode::LEAVE_SRC) }; if (defined($dec)) { say "$enc: $dec"; } else { print "$enc: Fail: $@"; } } ' < file

        Replace UTF-8 with the encoding your terminal expects.