Re: create clone script for utf8 encoding

I thought you might be interested in this: I just released "enctool", which will guess and verify files' encodings. For example, to test whether a file which you know contains Cyrillic characters is encoded in UTF-8 or KOI8-R: enctool --encodings=UTF-8,KOI8-R --one-of='\p{Script=Cyrillic}' filename.txt (there are lots of other options too, see the POD - in this case, e.g. --test-all --list-chars --extra-verbose might also be interesting). Although there are tests, I rewrote it pretty much from scratch from an earlier version, so I've still labeled it beta - if there are issues, let me know.

Update: If you work with KOI8-R a lot, you might want to change the default list of encodings, for example, one way is to put this in your ~/.profile: export ENCTOOL_ENCODINGS="ASCII,UTF-8,KOI8-R,Latin1,CP1252"

Comment on Re: create clone script for utf8 encoding Select or Download Code

Replies are listed 'Best First'.

Re^2: create clone script for utf8 encoding
by Anonymous Monk on Dec 25, 2018 at 16:47 UTC

I knew a person who could intuitively decipher Mojibake that resulted from mishandling single-byte encodings, like this:

KOI8-R CP1251 CP866 ← decoded as...

KOI8-R Привет рТЙЧЕФ Ё╥╔╫┼╘

CP1251 оПХБЕР Привет ╧ЁштхЄ

CP866 ▐Ю╗╒╔Б ЏаЁўҐв Привет

↑ encoded to...

Thankfully, it's not a frequent occasion when we have to resort to sorts of frequency analysis nowadays.

[reply]

Re^3: create clone script for utf8 encoding

by Aldebaran (Curate) on Jan 03, 2019 at 11:05 UTC

Interesting...I didn't know that Mojibake had a name but do have a sense of it. I think of it like being in the weeds.

[reply]