in reply to create clone script for utf8 encoding

I thought you might be interested in this: I just released "enctool", which will guess and verify files' encodings. For example, to test whether a file which you know contains Cyrillic characters is encoded in UTF-8 or KOI8-R: enctool --encodings=UTF-8,KOI8-R --one-of='\p{Script=Cyrillic}' filename.txt (there are lots of other options too, see the POD - in this case, e.g. --test-all --list-chars --extra-verbose might also be interesting). Although there are tests, I rewrote it pretty much from scratch from an earlier version, so I've still labeled it beta - if there are issues, let me know.

Update: If you work with KOI8-R a lot, you might want to change the default list of encodings, for example, one way is to put this in your ~/.profile: export ENCTOOL_ENCODINGS="ASCII,UTF-8,KOI8-R,Latin1,CP1252"

Replies are listed 'Best First'.
Re^2: create clone script for utf8 encoding
by Anonymous Monk on Dec 25, 2018 at 16:47 UTC

    I knew a person who could intuitively decipher Mojibake that resulted from mishandling single-byte encodings, like this:
    KOI8-RCP1251CP866← decoded as...
    KOI8-R Привет рТЙЧЕФ Ё╥╔╫┼╘
    CP1251 оПХБЕР Привет ╧ЁштхЄ
    CP866 ▐Ю╗╒╔Б ЏаЁўҐв Привет
    ↑ encoded to...
    Thankfully, it's not a frequent occasion when we have to resort to sorts of frequency analysis nowadays.

      Interesting...I didn't know that Mojibake had a name but do have a sense of it. I think of it like being in the weeds.