Aldebaran has asked for the wisdom of the Perl Monks concerning the following question:

Just as soon as ikegami made me aware that us-ascii is my default encoding, I seem to be developing problems with it. These problems are in the variety of my computer not behaving the way I expect it to.

I'm running ubuntu with bash, and when I touch a file into existence, it is us-ascii. Likewise, files that are formed from redirecting STDOUT begin their lives as us-ascii on this platform. Where is this determined on POSIX systems?

So this is a day in the life, where I use this nifty software: translate shell

$ trans :de -brief "over" >2.ascii.de.txt $ trans :de -brief "He must." >>2.ascii.de.txt $ cat 2.ascii.de.txt Über Er muss. $ iconv -f us-ascii -t UTF-8 2.ascii.de.txt -o 2.de.utf8.txt iconv: illegal input sequence at position 0 $

On STDOUT for me, I get Ü as the zeroth character. Does ascii have a representation for Ü?

Then I keep trying to get an iconv command to do something for me, but an effective syntax eludes me. Why is Ü illegal in the iconv command?

If I'm going to have source that has utf8 characters in it, doesn't it make sense to change the underlying encoding to utf8 or create it that way from the git-go?

After I've touched a file into existence, I use a bash script to clone the next version of a script. All of my scripts have a taxonomy of a positive integer followed by a period, followed by a word. The cloned script is incremented, given execute privileges, and has its name written to a manifest. There isn't any language in it for determining the underlying encoding. I've gotten a lot of mileage out of this script, but I think it's time that I need to replace it with shiny new, lexical perl. I'll put it in readmore tags for being somewhat OT:

$ cat 2.create.bash #!/bin/bash # which bash version? echo "The shebang is specifying bash" if [ -z "${BASH_VERSION}" ]; then echo "Not using bash but dash" else echo "Using bash ${BASH_VERSION}" fi #get the the first number from $1 #c=$(("$1" : '\([0-9]*\).*$')) didn't work c=$(expr "$1" : '\([0-9]*\).*$') echo $c f=$1 #integer addition d=$(expr $c + 1) echo $d #munge new file, no clobber t="$d" q=${f#*.} s=$t.$q echo $s cp -n $f $s chmod +x $s echo $s >> 1.manifest ls -lh $s gedit $s & $

I'd like to write a perl equivalent that would give me freedom to choose the underlying encoding. I'd show previous attempts, but they look awful.

Finally, what makes any of these en_**.utf8 encodings different from another?

$ locale charmap UTF-8 $ locale -a C C.UTF-8 en_AG en_AG.utf8 en_AU.utf8 en_BW.utf8 en_CA.utf8 en_DK.utf8 en_GB.utf8 en_HK.utf8 en_IE.utf8 en_IL en_IL.utf8 en_IN en_IN.utf8 en_NG en_NG.utf8 en_NZ.utf8 en_PH.utf8 en_SG.utf8 en_US.utf8 en_ZA.utf8 en_ZM en_ZM.utf8 en_ZW.utf8 POSIX ru_RU.utf8 ru_UA.utf8 $ locale -m ANSI_X3.110-1983 ANSI_X3.4-1968 ... UTF-8 VIDEOTEX-SUPPL VISCII WIN-SAMI-2 WINDOWS-31J $

Thanks for your comment

Replies are listed 'Best First'.
Re: create clone script for utf8 encoding
by haukex (Archbishop) on Dec 15, 2018 at 13:29 UTC

    You've already gotten some good answers, I just wanted to pick up on a couple more points. In general, you might want to have a look at The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), as well as perluniintro and perlunicode.

    when I touch a file into existence, it is us-ascii

    touch creates an empty file, and it doesn't have any encoding - a file's encoding is not some kind of metadata attribute secretly attached to a file. A file just contains bytes, and it is up to the reading and writing programs to interpret that sequence of bytes to and from the more abstract concept of "characters" (I am using that term loosely here) on reading and writing. Many software packages have defaults for such encodings, such as Latin-1, CP1252, UTF-8, or UTF-16, but the software packages often don't agree. And of those four examples, only the latter two allow you to encode all valid Unicode code points. As for ASCII, it is a subset of many different character encodings (such as Latin-1, CP1252, and UTF-8), and it only covers bytes with values 0 to 127 (the lower 7 bits of the byte), so it encodes even fewer characters:

    For example, the Euro symbol € ("\x{20AC}" or "\N{U+20AC}" in Perl) is:

    (Copied from my post here.) I wrote some more about the whole topic here.

    files that are formed from redirecting STDOUT begin their lives as us-ascii on this platform.

    As the AM post explained, that's unlikely here, since "Ü" is not representable in ASCII (which is also what iconv is telling you with its error) - most likely it's UTF-8, but you can check by piping the output to e.g. hexdump - for example, on my terminal echo -n "€" | hexdump -C shows the bytes e2 82 ac, and as I showed above, that's the UTF-8 encoding. If you're really unsure of a file's encoding, there's Encode::Guess (I showed an example here), keeping in mind that it's just guessing.

    I'd like to write a perl equivalent that would give me freedom to choose the underlying encoding.

    If you're talking about the Perl source code itself, IMHO really the only two useful choices are plain ASCII, or UTF-8, and in the latter case, you have to tell Perl by adding use utf8; at the top of your file (see utf8). If your Perl source code is in ASCII, you can still represent Unicode characters in strings and regexes using escapes like "\x{...}" and "\N{...}" (see also charnames). And since ASCII is a subset of UTF-8, if you stick to those two encodings for your Perl source, your "clone" script doesn't have anything to worry about, it can just cp the files, all you need to do is add the use utf8; when appropriate. Just make sure that whatever editor you're using to work on your Perl scripts uses UTF-8 when it saves the files.

    If you're talking about files that your Perl program is reading and writing, you'd specify those encodings with the three-argument open (which I'd recommend), with binmode, or set defaults with the open pragma (the latter is useful for changing the encoding of the STDIN/OUT/ERR streams as well). For en-/decoding strings of bytes you've already got in Perl, there's the Encode family of modules, plus for UTF-8, utf8::encode() and utf8::decode(). There's also the -C command-line switch (which I'd mostly only use for oneliners) and the PERLIO environment variable (which I've almost never had a need for), see perlrun.

    BTW, you can do the same thing as iconv with Perl:

    use warnings; use strict; # iconv -f UTF-8 utf8.txt -t Latin9 -o latin9.txt my ($ifile, $ienc) = ("utf8.txt", "UTF-8"); my ($ofile, $oenc) = ("latin9.txt", "Latin9"); open my $ifh, "<:raw:encoding($ienc)", $ifile or die "$ifile: $!"; open my $ofh, ">:raw:encoding($oenc)", $ofile or die "$ofile: $!"; print $ofh do { local $/; <$ifh> }; close $ifh; close $ofh;

      a file's encoding is not some kind of metadata attribute secretly attached to a file. A file just contains bytes, and it is up to the reading and writing programs to interpret that sequence of bytes to and from the more abstract concept of "characters" (I am using that term loosely here) on reading and writing.

      Am I correct that what my OS is telling me is its best guess as to how to interpret this file and have it make any sense?

      $ file -i *.pl 18.clone.pl: text/x-perl; charset=us-ascii 1.a.pl: text/x-perl; charset=utf-8 1.haukex.pl: text/x-perl; charset=us-ascii 1.k.pl: text/x-perl; charset=us-ascii 2.haukex.pl: text/x-perl; charset=us-ascii 3.haukex.pl: text/x-perl; charset=utf-8 3.ping3a.pl: text/x-perl; charset=us-ascii 4.haukex.pl: text/x-perl; charset=utf-8 4.ping3a.pl: text/x-perl; charset=us-ascii 5.ping3a.pl: text/x-perl; charset=us-ascii $

      What seems to be very much the case is that the OS thinks the doc is utf8 if there are utf8 non-ascii characters in it. I did nothing with the #.haukex scripts to change from us-ascii to utf8 but begin to include cyrillic characters, like so with pre tags:

      $ ./1.a.pl 3.haukex.pl
      argv is 3.haukex.pl
      before decode is 3.haukex.pl
      after decode is 3.haukex.pl
      current is /home/bob/2.scripts/pages/1.cw/template_stuff/translations/rus.cw
      -------------
      in_file: 3.haukex.pl
      new base is 4.haukex.pl
      save path is /home/bob/2.scripts/pages/1.cw/template_stuff/translations/rus.cw/4.haukex.pl
      return is /home/bob/2.scripts/pages/1.cw/template_stuff/translations/rus.cw/4.haukex.pl
      2.haukex.pl
      3.haukex.pl
      4.haukex.pl#!/usr/bin/perl -w
      use 5.011;
      use Carp;
      use Data::Alias 'alias';
      use Data::Dumper;
      use utf8;   # a la François
      use open OUT => ':encoding(utf8)';
      use open ':std';
      
      sub rangeparse {
      	local $_ = shift;
      	my @o;  #  row1,col1, row2,col2  (-1 = last row/col)
      	if (@o=/\AR(0-9+|n)C(0-9+|n):R(0-9+|n)C(0-9+|n)\z/) {}
      	elsif (/\AR(0-9+|n):R(0-9+|n)\z/) { @o=($1,1,$2,-1) }
      	elsif (/\AC(0-9+|n):C(0-9+|n)\z/) { @o=(1,$1,-1,$2) }
      	elsif (/\AR(0-9+|n)C(0-9+|n)\z/) { @o=($1,$2,$1,$2) }
      	elsif (/\AR(0-9+|n)\z/) { @o=($1,1,$1,-1) }
      	elsif (/\AC(0-9+|n)\z/) { @o=(1,$1,-1,$1) }
      	else { croak "failed to parse '$_'" }
      	$_ eq 'n' and $_=-1 for @o;
      	return \@o;
      }
      
      
      
      use Test::More tests=>2;
      
      
      
      is_deeply rangeparse("RnC2:RnC5"),   -1, 2, -1, 5 ;
      is_deeply rangeparse("R3C2:RnCn"),    3, 2, -1,-1 ;
      
      my $data = ['&#1081;', ' ', ' ', '&#1083;', ' ', ' ',  '&#1089;', ' ', ' ', 1..9];
      
      say Dumper $data; 
      
      
      sub getsubset {
      	my ($data,$range) = @_;
      	my $cols = @{$$data[0]};
      	@$_==$cols or croak "data not rectangular" for @$data;
      	$range = rangeparse($range) unless ref $range eq 'ARRAY';
      	@$range==4 or croak "bad size of range";
      	my @max = (0+@$data,$cols)x2;
      	for my $i (0..3) {
      		$$range$i=$max$i if $$range$i<0;
      		croak "index $i out of range"
      			if $$range$i<1 || $$range$i>$max$i;
      	}
      	croak "bad rows $$range[0]-$$range2" if $$range[0]>$$range2;
      	croak "bad cols $$range1-$$range3" if $$range1>$$range3;
      	my @cis = $$range1-1 .. $$range3-1;
      	return [ map { sub{\@_}->(@{$$data$_}@cis) }
      		$$range[0]-1 .. $$range2-1 ]
      }
      
      

      This is a trimmed down version of haukex's result in Selecting Ranges of 2-Dimensional Data. I'm populating it with cyrillic values and hope to run some tests, but I still want to get this clone tool squared away. Still working through other parts of your post....

        Am I correct that what my OS is telling me is its best guess as to how to interpret this file and have it make any sense?

        Yes, with the emphasis being that it's just a guess.

        the OS thinks the doc is utf8 if there are utf8 non-ascii characters in it.

        Yes, and it might be important to note that there are certain sequences of bytes that are not valid UTF-8 (see e.g. UTF-8), which means that in some cases, it's possible to differentiate between random bytes and valid UTF-8 text. Also, just to nitpick a little, it's not the OS guessing the file's encoding, it's the file tool.

        I did nothing with the #.haukex scripts to change from us-ascii to utf8 but begin to include cyrillic characters

        Note that if you have a file that is originally ASCII and you add non-ASCII characters to it, it's up to the editor to choose which encoding it will use when saving the file. Many editors will default to UTF-8, but some may not!

        with pre tags

        You may have noticed that when using <pre> tags, you have to escape square brackets, [ is &#91; and ] is &#93;.

        Update: Improved wording of first paragraph.

Re: create clone script for utf8 encoding
by Anonymous Monk on Dec 15, 2018 at 08:17 UTC
    I'm running ubuntu with bash, and when I touch a file into existence, it is us-ascii. Likewise, files that are formed from redirecting STDOUT begin their lives as us-ascii on this platform. Where is this determined on POSIX systems?

    Strictly speaking, that depends both on the program that created the file and your interpretation of it. For example, printf '\xf0\xd2\xc9\xd7\xc5\xd4\n' > file would create a file filled with bytes, that, when interpreted as KOI8-R (iconv -f koi8-r file), would translate to a greeting in Russian.

    It just so happens that when you type text on your English keyboard and it's encoded into bytes according to the rules defined by your locale, its UTF8-encoded bytes (Ubuntu has been UTF-8 by default for years) have the same meaning if you decode them as ASCII. UTF-8 has been designed to be "backwards compatible" to ASCII when it comes to the first 128 code points.

    $ iconv -f us-ascii -t UTF-8 2.ascii.de.txt -o 2.de.utf8.txt iconv: illegal input sequence at position 0
    Does ascii have a representation for Ü?

    No. If you consult the ASCII table, you will see that it only defines glyphs corresponding to byte values 0..127. With 26*2 letters + 10 digits + 32 control characters to be interpreted by teletypes (or terminal emulators) there is only enough space for some punctuation marks, but no accented characters. Single-byte encodings like ISO-8859-1 or KOI8-R use the byte values 128..255 for that.

    If you run file 2.ascii.de.txt, you will see that it's actually UTF-8. file can also discern pure ASCII files - because they don't have any bytes above 127 - but cannot discern different single-byte non-ASCII encodings. Those can contain any byte values, and you have to know statistics about the languages used for those encodings to guess - not 100% right - which language and which encoding it is. UTF-8 can also contain any byte values, but the bytes always follow specific rules which can be easily checked.

    Finally, what makes any of these en_**.utf8 encodings different from another?

    Those are locales, not encodings. The encoding specified by most of the locales is UTF-8, but the underlying locale settings like number format (decimal dot or comma?), date-time format (Y-m-d or m/d/y? 12 hours or 24 hours?), string collation rules (yes, the way we sort strings depends on the language they are in), etc) are different.

    Удачи,
      Strictly speaking, that depends both on the program that created the file and your interpretation of it. For example, printf '\xf0\xd2\xc9\xd7\xc5\xd4\n' > file would create a file filled with bytes, that, when interpreted as KOI8-R (iconv -f koi8-r file), would translate to a greeting in Russian.

      Спасибо, анонимный монах. I try to run all the posted source on threads where I'm OP, and I was very pleased to run yours and have an iconv command that worked 100 percent. The command gave me a lot of partial credit for failed attempts, which helped diagnose the way. I sense that you are experienced with cyrillic encodings, so I'm very happy to have your attention to my issues, which must seem parochial by your standards.

      $ printf '\xf0\xd2\xc9\xd7\xc5\xd4\n' > 1.file $ iconv -f koi8-r 1.file &#1055;&#1088;&#1080;&#1074;&#1077;&#1090; $ iconv -f koi8-r 1.file -o 1.prubyet $ file 1.prubyet 1.prubyet: UTF-8 Unicode text $ cat 1.prubyet &#1055;&#1088;&#1080;&#1074;&#1077;&#1090; $ cat 1.file &#65533;&#65533;&#65533;&#65533;&#65533;&#65533; $ file 1.file 1.file: ISO-8859 text

      I know how these look to in the terminal and in my editor. 3.file shows the cyrillic greeting. 1.prubyet has six diamonds with question marks in the middle.

      I wondered what diff would think of them:

      $ echo &#1055;&#1088;&#1080;&#1074;&#1077;&#1090; >3.file $ diff 1.file 3.file 1c1 < &#65533;&#65533;&#65533;&#65533;&#65533;&#65533; --- > &#1055;&#1088;&#1080;&#1074;&#1077;&#1090; $

      I'm looking at 1.file and 3.file in the hex editor. 1.file was exactly what I expected, but 3.file has one value more than the 12 I expected. (?)

      D0 9F D1 80 D0 B8 D0 B2 D0 B5 D1 82 0A

      I'd hoped this renders faithfully with monastery code tags. Do I gather that code tags unravel things that aren't us-ascii? Has anyone ever suggested having a form of code tag that did not do this?

        use <pre> instead of <code> for unicode
        3.file has one value more than the 12 I expected. (?)
        The 0A at the end is the newline, "\n". If you omit it, the shell prompt will be printed on the same line as the text:
        username@localhost:~$ printf '\xf0\xd2\xc9\xd7\xc5\xd4' | iconv -f koi8-r
        Приветusername@localhost:~$
        Together with carriage return "\r", this can be used to produce various effects on the console. For example, the following program prints two different strings, but after it's finished the terminal will look like it didn't print anything:
        perl -e '$|=1; print "Now you see me!"; sleep 1; print "\r"; print "No +w you don\x27t! "; sleep 1; printf "\r"'
        (Actually, you may see part of its output if your shell prompt is short enough. For more honest but less portable version, see man console_codes.)
Re: create clone script for utf8 encoding
by kcott (Archbishop) on Dec 15, 2018 at 09:58 UTC

    G'day Aldebaran,

    "All of my scripts have a taxonomy of a positive integer followed by a period, followed by a word."

    I'd probably aim for something like this in Perl:

    $ perl -E 'my $x = "2.X"; my ($y, $z) = ($x =~ /^(\d+)(.*)$/); say $x; + say ++$y . $z' 2.X 3.X

    This allows leading zeros in your filenames to be retained when incremented:

    $ perl -E 'my $x = "02.X"; my ($y, $z) = ($x =~ /^(\d+)(.*)$/); say $x +; say ++$y . $z' 02.X 03.X

    Which you may want to consider for a future filename convention to facilitate sorting. Compare these:

    $ perl -E 'say for sort qw{1.X 9.X}' 1.X 9.X $ perl -E 'say for sort qw{1.X 9.X 10.X}' 1.X 10.X 9.X $ perl -E 'say for sort qw{01.X 09.X 10.X}' 01.X 09.X 10.X
    "I'd like to write a perl equivalent that would give me freedom to choose the underlying encoding."

    There's a variety of ways to do this. You don't say how the encoding is chosen: the following is just general information. See the open pragma and the open function; both contain links to additional information. See any perluni* links in http://perldoc.perl.org/perl.html.

    "Does ascii have a representation for Ü?"

    I see that AM has answered this. In Unicode, that character is:

    U+00DC LATIN CAPITAL LETTER U WITH DIAERESIS

    You might want to look at the Unicode Code Charts. That particular character is in (PDF link) "C1 Controls and Latin-1 Supplement". You may find it useful to familiarise yourself with the blocks of characters you deal with most often, perhaps even download a copy for quick reference; alternatively, if this is something you'll only want occasionally, individual internet searches will find the same information (searching for just "Unicode Ü" worked for me).

    "I'd show previous attempts, but they look awful."

    Showing what you've tried, even unsuccessful attempts, will often help us to better help you. We can see where your thought process (via code logic) was heading and perhaps steer you on a better course. A few minor tweaks may turn "awful" into "awesome". It can also help future readers who might be trying similar things.

    — Ken

      Showing what you've tried, even unsuccessful attempts, will often help us to better help you. We can see where your thought process (via code logic) was heading and perhaps steer you on a better course. A few minor tweaks may turn "awful" into "awesome". It can also help future readers who might be trying similar things.

      If you really want to see some broken code, I've got one more problem with regex for highest number; it's actually this related regex that is failing for me as it rounds the bend into double digit territory.

      I have a page like a garden-variety webpage, where it's clearly not a finished product, but it won't give me a number beyond ten, using this:

      sub highest_number{ use strict; use File::Basename; use Cwd; my ($aref, $filetype, $word) = @_; my $number; my @matching; my $ext = ".".$filetype; push (@matching, 0); #min returned value for my $file (@{$aref}) { #print "file is $file\n"; if ($file =~ /^$word(\d*)$ext$/){ #print "matching is $file\n"; push (@matching, $1); } } @matching = sort @matching; my $winner = pop @matching; return $winner }

      Did I not promise awful? I don't even got how this code could run without a semi-colon after $winner .

      But I have made progress on the central task, and thank you for your response. Until I can find a meaningful name and place for it, I have called it 1.a.pl:

      $ history | tail - 10 ==> standard input <== 1987 pt 1.a.pl ... 1990 ./1.a.pl 1.k.pl 1991 rm 2.k.pl 1992 ./1.a.pl 1.k.pl 1993 file -i *.pl 1994 cat 1.manifest ... 1996 history | tail - 10 tail: cannot open '10' for reading: No such file or directory $

      This is getting there:

      $ cat 1.a.pl #!/usr/bin/perl -w use 5.011; use Path::Tiny; use Encode; use utf8; # a la François use open OUT => ':encoding(utf8)'; use open ':std'; # This script increments and clones the file in $1. ## enabling cyrillic ## decode argv and current say "argv is @ARGV"; foreach (@ARGV) { say "before decode is $_"; $_ = decode( 'UTF-8', $_ ); say "after decode is $_"; } my (@in_files) = @ARGV; my $current = Path::Tiny->cwd; $current = decode( 'UTF-8', $current ); say "current is $current"; say "-------------"; say "in_file: @in_files"; for (@in_files) { my $tiny_in = path($_); ## use Path::Tiny my $file_contents = $tiny_in->slurp_utf8; $_ =~ m/^(\d+)(.*)$/; my $number = $1; my $rest = $2; my $increment = $number + 1; my $new_base = $increment . $rest; say "new base is $new_base"; ## use Path::Tiny to create new file my $save_file = path( $current, $new_base )->touchpath; say "save path is $save_file"; my $return = $tiny_in->copy($save_file); $return->chmod(0755); say "return is $return"; ## write to local manifest my $manifest_name = "1.manifest"; path($manifest_name)->append_utf8( $new_base . "\n" ); system "cat $manifest_name"; system "cat $save_file"; } $

      It looks like I got the spacing on the manifest file squared away:

      $ cat 1.manifest 2.haukex.pl 3.haukex.pl 4.haukex.pl4.ping3a.pl5.haukex.pl5.ping3a.pl2.k.pl/n2.k.pl 3.k.pl $
        "I don't even got how this code could run without a semi-colon after $winner ."

        It's the last line of "sub highest_number { ... }". Technically, that is terminated by the final brace: while this is valid code, I wouldn't recommend it (certainly not for production code). The problem is that if more code is added you'll have a statement that is neither terminated by a brace nor a semicolon.

        The same applies to arrays (and hashes) and commas. Consider this contrived example:

        my @numbers = ( 'one', 'three', 'two' );

        Oops! I'll just reorder those (a simple ddp sequence in vi):

        my @numbers = ( 'one', 'two' 'three', );

        Now you've got an even bigger "Oops!".

        Having said that, the biggest, single improvement you could make to this code would be the use of consistent indentation. It took me some time, tracking back and forth between opening and closing braces, to see where the related blocks of code were. This type of code is highly error-prone. Perhaps look at perlstyle; and perltidy may prove useful.

        — Ken

Re: create clone script for utf8 encoding
by dsheroh (Monsignor) on Dec 16, 2018 at 10:16 UTC
    Finally, what makes any of these en_**.utf8 encodings different from another?
    All of the en_**.utf8 locales use utf8 encoding, as indicated by the .utf8 ending. At that level, they are identical.

    Where they differ is in the instructions they provide to programs on how to interpret and display data which may vary depending on what country/culture you are in. For example, in an en_US locale, the currency symbol is $, while in an en_GB locale, it is £. Other things which may vary by locale include sort orders, date formats, decimal and thousands separators, address formats, the wording of system-produced messages, spelling ("color" (en_US) vs. "colour" (en_GB)), and so on.

    As you may now see, the locale setting is generally three distinct pieces of information packed into a single string. The en_US.utf8 locale tells programs to use the English language (en), and specifically the US dialect of English and US formatting conventions (_US), and to encode character data with the utf8 encoding scheme (.utf8).

Re: create clone script for utf8 encoding
by ikegami (Patriarch) on Dec 18, 2018 at 03:09 UTC

    Note that I said that Perl expects Perl code to be ASCII by default (and that it's tolerant of illegal bytes in string literals). No encoding is used for file handles (e.g. STDOUT) by default.

Re: create clone script for utf8 encoding
by haukex (Archbishop) on Dec 25, 2018 at 14:40 UTC

    I thought you might be interested in this: I just released "enctool", which will guess and verify files' encodings. For example, to test whether a file which you know contains Cyrillic characters is encoded in UTF-8 or KOI8-R: enctool --encodings=UTF-8,KOI8-R --one-of='\p{Script=Cyrillic}' filename.txt (there are lots of other options too, see the POD - in this case, e.g. --test-all --list-chars --extra-verbose might also be interesting). Although there are tests, I rewrote it pretty much from scratch from an earlier version, so I've still labeled it beta - if there are issues, let me know.

    Update: If you work with KOI8-R a lot, you might want to change the default list of encodings, for example, one way is to put this in your ~/.profile: export ENCTOOL_ENCODINGS="ASCII,UTF-8,KOI8-R,Latin1,CP1252"

      I knew a person who could intuitively decipher Mojibake that resulted from mishandling single-byte encodings, like this:
      KOI8-RCP1251CP866← decoded as...
      KOI8-R Привет рТЙЧЕФ Ё╥╔╫┼╘
      CP1251 оПХБЕР Привет ╧ЁштхЄ
      CP866 ▐Ю╗╒╔Б ЏаЁўҐв Привет
      ↑ encoded to...
      Thankfully, it's not a frequent occasion when we have to resort to sorts of frequency analysis nowadays.

        Interesting...I didn't know that Mojibake had a name but do have a sense of it. I think of it like being in the weeds.