create clone script for utf8 encoding

Aldebaran has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: create clone script for utf8 encoding
by haukex (Archbishop) on Dec 15, 2018 at 13:29 UTC

You've already gotten some good answers, I just wanted to pick up on a couple more points. In general, you might want to have a look at The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), as well as perluniintro and perlunicode.

when I touch a file into existence, it is us-ascii

touch creates an empty file, and it doesn't have any encoding - a file's encoding is not some kind of metadata attribute secretly attached to a file. A file just contains bytes, and it is up to the reading and writing programs to interpret that sequence of bytes to and from the more abstract concept of "characters" (I am using that term loosely here) on reading and writing. Many software packages have defaults for such encodings, such as Latin-1, CP1252, UTF-8, or UTF-16, but the software packages often don't agree. And of those four examples, only the latter two allow you to encode all valid Unicode code points. As for ASCII, it is a subset of many different character encodings (such as Latin-1, CP1252, and UTF-8), and it only covers bytes with values 0 to 127 (the lower 7 bits of the byte), so it encodes even fewer characters:

Read more... ASCII table (4 kB)

For example, the Euro symbol € ("\x{20AC}" or "\N{U+20AC}" in Perl) is:

Not representable in ASCII or Latin1 (aka ISO/IEC 8859-1)
In Latin-9 (aka ISO/IEC 8859-15) it is the byte 0xA4
In Windows-1252 (aka CP-1252 and sometimes aka "ANSI") it is the byte 0x80
In MacRoman it is the byte 0xDB (at least Mac OS >=8.5)
In UTF-8 it is the bytes 0xE2, 0x82, 0xAC
In UTF-16 it is 0x20AC - depending on endianness, bytes 0x20, 0xAC in UTF-16-BE, or 0xAC, 0x20 in UTF-16-LE

(Copied from my post here.) I wrote some more about the whole topic here.

files that are formed from redirecting STDOUT begin their lives as us-ascii on this platform.

As the AM post explained, that's unlikely here, since "Ü" is not representable in ASCII (which is also what iconv is telling you with its error) - most likely it's UTF-8, but you can check by piping the output to e.g. hexdump - for example, on my terminal echo -n "€" | hexdump -C shows the bytes e2 82 ac, and as I showed above, that's the UTF-8 encoding. If you're really unsure of a file's encoding, there's Encode::Guess (I showed an example here), keeping in mind that it's just guessing.

I'd like to write a perl equivalent that would give me freedom to choose the underlying encoding.

If you're talking about the Perl source code itself, IMHO really the only two useful choices are plain ASCII, or UTF-8, and in the latter case, you have to tell Perl by adding use utf8; at the top of your file (see utf8). If your Perl source code is in ASCII, you can still represent Unicode characters in strings and regexes using escapes like "\x{...}" and "\N{...}" (see also charnames). And since ASCII is a subset of UTF-8, if you stick to those two encodings for your Perl source, your "clone" script doesn't have anything to worry about, it can just cp the files, all you need to do is add the use utf8; when appropriate. Just make sure that whatever editor you're using to work on your Perl scripts uses UTF-8 when it saves the files.

If you're talking about files that your Perl program is reading and writing, you'd specify those encodings with the three-argument open (which I'd recommend), with binmode, or set defaults with the open pragma (the latter is useful for changing the encoding of the STDIN/OUT/ERR streams as well). For en-/decoding strings of bytes you've already got in Perl, there's the Encode family of modules, plus for UTF-8, utf8::encode() and utf8::decode(). There's also the -C command-line switch (which I'd mostly only use for oneliners) and the PERLIO environment variable (which I've almost never had a need for), see perlrun.

BTW, you can do the same thing as iconv with Perl:

use warnings;
use strict;

# iconv -f UTF-8 utf8.txt -t Latin9 -o latin9.txt
my ($ifile, $ienc) = ("utf8.txt",   "UTF-8");
my ($ofile, $oenc) = ("latin9.txt", "Latin9");
open my $ifh, "<:raw:encoding($ienc)", $ifile or die "$ifile: $!";
open my $ofh, ">:raw:encoding($oenc)", $ofile or die "$ofile: $!";
print $ofh do { local $/; <$ifh> };
close $ifh;
close $ofh;
[download]

[reply]
[d/l]
[select]

Re^2: create clone script for utf8 encoding

by Aldebaran (Curate) on Dec 19, 2018 at 04:29 UTC

a file's encoding is not some kind of metadata attribute secretly attached to a file. A file just contains bytes, and it is up to the reading and writing programs to interpret that sequence of bytes to and from the more abstract concept of "characters" (I am using that term loosely here) on reading and writing.

Am I correct that what my OS is telling me is its best guess as to how to interpret this file and have it make any sense?

$ file -i *.pl
18.clone.pl: text/x-perl; charset=us-ascii
1.a.pl:      text/x-perl; charset=utf-8
1.haukex.pl: text/x-perl; charset=us-ascii
1.k.pl:      text/x-perl; charset=us-ascii
2.haukex.pl: text/x-perl; charset=us-ascii
3.haukex.pl: text/x-perl; charset=utf-8
3.ping3a.pl: text/x-perl; charset=us-ascii
4.haukex.pl: text/x-perl; charset=utf-8
4.ping3a.pl: text/x-perl; charset=us-ascii
5.ping3a.pl: text/x-perl; charset=us-ascii
$
[download]

What seems to be very much the case is that the OS thinks the doc is utf8 if there are utf8 non-ascii characters in it. I did nothing with the #.haukex scripts to change from us-ascii to utf8 but begin to include cyrillic characters, like so with pre tags:

$ ./1.a.pl 3.haukex.pl
argv is 3.haukex.pl
before decode is 3.haukex.pl
after decode is 3.haukex.pl
current is /home/bob/2.scripts/pages/1.cw/template_stuff/translations/rus.cw
-------------
in_file: 3.haukex.pl
new base is 4.haukex.pl
save path is /home/bob/2.scripts/pages/1.cw/template_stuff/translations/rus.cw/4.haukex.pl
return is /home/bob/2.scripts/pages/1.cw/template_stuff/translations/rus.cw/4.haukex.pl
2.haukex.pl
3.haukex.pl
4.haukex.pl#!/usr/bin/perl -w
use 5.011;
use Carp;
use Data::Alias 'alias';
use Data::Dumper;
use utf8;   # a la François
use open OUT => ':encoding(utf8)';
use open ':std';

sub rangeparse {
	local $_ = shift;
	my @o;  #  row1,col1, row2,col2  (-1 = last row/col)
	if (@o=/\AR(0-9+|n)C(0-9+|n):R(0-9+|n)C(0-9+|n)\z/) {}
	elsif (/\AR(0-9+|n):R(0-9+|n)\z/) { @o=($1,1,$2,-1) }
	elsif (/\AC(0-9+|n):C(0-9+|n)\z/) { @o=(1,$1,-1,$2) }
	elsif (/\AR(0-9+|n)C(0-9+|n)\z/) { @o=($1,$2,$1,$2) }
	elsif (/\AR(0-9+|n)\z/) { @o=($1,1,$1,-1) }
	elsif (/\AC(0-9+|n)\z/) { @o=(1,$1,-1,$1) }
	else { croak "failed to parse '$_'" }
	$_ eq 'n' and $_=-1 for @o;
	return \@o;
}



use Test::More tests=>2;



is_deeply rangeparse("RnC2:RnC5"),   -1, 2, -1, 5 ;
is_deeply rangeparse("R3C2:RnCn"),    3, 2, -1,-1 ;

my $data = ['&#1081;', ' ', ' ', '&#1083;', ' ', ' ',  '&#1089;', ' ', ' ', 1..9];

say Dumper $data; 


sub getsubset {
	my ($data,$range) = @_;
	my $cols = @{$$data[0]};
	@$_==$cols or croak "data not rectangular" for @$data;
	$range = rangeparse($range) unless ref $range eq 'ARRAY';
	@$range==4 or croak "bad size of range";
	my @max = (0+@$data,$cols)x2;
	for my $i (0..3) {
		$$range$i=$max$i if $$range$i<0;
		croak "index $i out of range"
			if $$range$i<1 || $$range$i>$max$i;
	}
	croak "bad rows $$range[0]-$$range2" if $$range[0]>$$range2;
	croak "bad cols $$range1-$$range3" if $$range1>$$range3;
	my @cis = $$range1-1 .. $$range3-1;
	return [ map { sub{\@_}->(@{$$data$_}@cis) }
		$$range[0]-1 .. $$range2-1 ]
}

This is a trimmed down version of haukex's result in Selecting Ranges of 2-Dimensional Data. I'm populating it with cyrillic values and hope to run some tests, but I still want to get this clone tool squared away. Still working through other parts of your post....

[reply]
[d/l]

Re^3: create clone script for utf8 encoding

by haukex (Archbishop) on Dec 19, 2018 at 10:15 UTC

Am I correct that what my OS is telling me is its best guess as to how to interpret this file and have it make any sense?

Yes, with the emphasis being that it's just a guess.

the OS thinks the doc is utf8 if there are utf8 non-ascii characters in it.

Yes, and it might be important to note that there are certain sequences of bytes that are not valid UTF-8 (see e.g. UTF-8), which means that in some cases, it's possible to differentiate between random bytes and valid UTF-8 text. Also, just to nitpick a little, it's not the OS guessing the file's encoding, it's the file tool.

I did nothing with the #.haukex scripts to change from us-ascii to utf8 but begin to include cyrillic characters

Note that if you have a file that is originally ASCII and you add non-ASCII characters to it, it's up to the editor to choose which encoding it will use when saving the file. Many editors will default to UTF-8, but some may not!

with pre tags

You may have noticed that when using <pre> tags, you have to escape square brackets, [ is [ and ] is ].

Update: Improved wording of first paragraph.

[reply]
[d/l]

Re^4: create clone script for utf8 encoding

by Aldebaran (Curate) on Dec 19, 2018 at 21:23 UTC

Re: create clone script for utf8 encoding
by Anonymous Monk on Dec 15, 2018 at 08:17 UTC

I'm running ubuntu with bash, and when I touch a file into existence, it is us-ascii. Likewise, files that are formed from redirecting STDOUT begin their lives as us-ascii on this platform. Where is this determined on POSIX systems?

Strictly speaking, that depends both on the program that created the file and your interpretation of it. For example, printf '\xf0\xd2\xc9\xd7\xc5\xd4\n' > file would create a file filled with bytes, that, when interpreted as KOI8-R (iconv -f koi8-r file), would translate to a greeting in Russian.

It just so happens that when you type text on your English keyboard and it's encoded into bytes according to the rules defined by your locale, its UTF8-encoded bytes (Ubuntu has been UTF-8 by default for years) have the same meaning if you decode them as ASCII. UTF-8 has been designed to be "backwards compatible" to ASCII when it comes to the first 128 code points.

$ iconv -f us-ascii -t UTF-8 2.ascii.de.txt -o 2.de.utf8.txt
iconv: illegal input sequence at position 0
[download]

Does ascii have a representation for Ü?

No. If you consult the ASCII table, you will see that it only defines glyphs corresponding to byte values 0..127. With 26*2 letters + 10 digits + 32 control characters to be interpreted by teletypes (or terminal emulators) there is only enough space for some punctuation marks, but no accented characters. Single-byte encodings like ISO-8859-1 or KOI8-R use the byte values 128..255 for that.

If you run file 2.ascii.de.txt, you will see that it's actually UTF-8. file can also discern pure ASCII files - because they don't have any bytes above 127 - but cannot discern different single-byte non-ASCII encodings. Those can contain any byte values, and you have to know statistics about the languages used for those encodings to guess - not 100% right - which language and which encoding it is. UTF-8 can also contain any byte values, but the bytes always follow specific rules which can be easily checked.

Finally, what makes any of these en_**.utf8 encodings different from another?

Those are locales, not encodings. The encoding specified by most of the locales is UTF-8, but the underlying locale settings like number format (decimal dot or comma?), date-time format (Y-m-d or m/d/y? 12 hours or 24 hours?), string collation rules (yes, the way we sort strings depends on the language they are in), etc) are different.

Удачи,

[reply]
[d/l]
[select]

Re^2: create clone script for utf8 encoding

by Aldebaran (Curate) on Dec 16, 2018 at 22:30 UTC

Strictly speaking, that depends both on the program that created the file and your interpretation of it. For example, printf '\xf0\xd2\xc9\xd7\xc5\xd4\n' > file would create a file filled with bytes, that, when interpreted as KOI8-R (iconv -f koi8-r file), would translate to a greeting in Russian.

Спасибо, анонимный монах. I try to run all the posted source on threads where I'm OP, and I was very pleased to run yours and have an iconv command that worked 100 percent. The command gave me a lot of partial credit for failed attempts, which helped diagnose the way. I sense that you are experienced with cyrillic encodings, so I'm very happy to have your attention to my issues, which must seem parochial by your standards.

$ printf '\xf0\xd2\xc9\xd7\xc5\xd4\n' > 1.file 
$ iconv -f koi8-r 1.file
&#1055;&#1088;&#1080;&#1074;&#1077;&#1090;
$ iconv -f koi8-r 1.file -o 1.prubyet
$ file 1.prubyet 
1.prubyet: UTF-8 Unicode text
$ cat 1.prubyet 
&#1055;&#1088;&#1080;&#1074;&#1077;&#1090;
$ cat 1.file
&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;
$ file 1.file
1.file: ISO-8859 text
[download]

I know how these look to in the terminal and in my editor. 3.file shows the cyrillic greeting. 1.prubyet has six diamonds with question marks in the middle.

I wondered what diff would think of them:

$ echo &#1055;&#1088;&#1080;&#1074;&#1077;&#1090; >3.file
$ diff 1.file 3.file
1c1
< &#65533;&#65533;&#65533;&#65533;&#65533;&#65533;
---
> &#1055;&#1088;&#1080;&#1074;&#1077;&#1090;
$
[download]

I'm looking at 1.file and 3.file in the hex editor. 1.file was exactly what I expected, but 3.file has one value more than the 12 I expected. (?)

D0 9F D1 80 D0 B8 D0 B2 D0 B5 D1 82 0A

I'd hoped this renders faithfully with monastery code tags. Do I gather that code tags unravel things that aren't us-ascii? Has anyone ever suggested having a form of code tag that did not do this?

[reply]
[d/l]
[select]

Re^3: create clone script for utf8 encoding

by Anonymous Monk on Dec 16, 2018 at 22:43 UTC

<pre>

<code>

[reply]
[d/l]
[select]

Re^4: create clone script for utf8 encoding

by Aldebaran (Curate) on Dec 17, 2018 at 23:42 UTC

Re^5: create clone script for utf8 encoding

by Anonymous Monk on Dec 18, 2018 at 08:52 UTC

Some notes below your chosen depth have not been shown here

Re^3: create clone script for utf8 encoding

by Anonymous Monk on Dec 17, 2018 at 09:49 UTC

3.file has one value more than the 12 I expected. (?)

0A

"\n"

username@localhost:~$ printf '\xf0\xd2\xc9\xd7\xc5\xd4' | iconv -f koi8-r
Приветusername@localhost:~$

"\r"

perl -e '$|=1; print "Now you see me!"; sleep 1; print "\r"; print "No
+w you don\x27t! "; sleep 1; printf "\r"'
[download]

man console_codes

[reply]
[d/l]
[select]

Re: create clone script for utf8 encoding
by kcott (Archbishop) on Dec 15, 2018 at 09:58 UTC

G'day Aldebaran,

"All of my scripts have a taxonomy of a positive integer followed by a period, followed by a word."

I'd probably aim for something like this in Perl:

$ perl -E 'my $x = "2.X"; my ($y, $z) = ($x =~ /^(\d+)(.*)$/); say $x;
+ say ++$y . $z'
2.X
3.X
[download]

This allows leading zeros in your filenames to be retained when incremented:

$ perl -E 'my $x = "02.X"; my ($y, $z) = ($x =~ /^(\d+)(.*)$/); say $x
+; say ++$y . $z'
02.X
03.X
[download]

Which you may want to consider for a future filename convention to facilitate sorting. Compare these:

$ perl -E 'say for sort qw{1.X 9.X}'
1.X
9.X

$ perl -E 'say for sort qw{1.X 9.X 10.X}'
1.X
10.X
9.X

$ perl -E 'say for sort qw{01.X 09.X 10.X}'
01.X
09.X
10.X
[download]

"I'd like to write a perl equivalent that would give me freedom to choose the underlying encoding."

There's a variety of ways to do this. You don't say how the encoding is chosen: the following is just general information. See the open pragma and the open function; both contain links to additional information. See any perluni* links in http://perldoc.perl.org/perl.html.

"Does ascii have a representation for Ü?"

I see that AM has answered this. In Unicode, that character is:

U+00DC LATIN CAPITAL LETTER U WITH DIAERESIS
[download]

You might want to look at the Unicode Code Charts. That particular character is in (PDF link) "C1 Controls and Latin-1 Supplement". You may find it useful to familiarise yourself with the blocks of characters you deal with most often, perhaps even download a copy for quick reference; alternatively, if this is something you'll only want occasionally, individual internet searches will find the same information (searching for just "Unicode Ü" worked for me).

"I'd show previous attempts, but they look awful."

Showing what you've tried, even unsuccessful attempts, will often help us to better help you. We can see where your thought process (via code logic) was heading and perhaps steer you on a better course. A few minor tweaks may turn "awful" into "awesome". It can also help future readers who might be trying similar things.

— Ken

[reply]
[d/l]
[select]

Re^2: create clone script for utf8 encoding

by Aldebaran (Curate) on Dec 19, 2018 at 05:26 UTC

Showing what you've tried, even unsuccessful attempts, will often help us to better help you. We can see where your thought process (via code logic) was heading and perhaps steer you on a better course. A few minor tweaks may turn "awful" into "awesome". It can also help future readers who might be trying similar things.

If you really want to see some broken code, I've got one more problem with regex for highest number; it's actually this related regex that is failing for me as it rounds the bend into double digit territory.

I have a page like a garden-variety webpage, where it's clearly not a finished product, but it won't give me a number beyond ten, using this:

sub highest_number{
use strict;
use File::Basename;
use Cwd;

my ($aref, $filetype, $word) = @_;
my $number;
my @matching;
my $ext = ".".$filetype;
push (@matching, 0); #min returned value
for my $file (@{$aref}) {
#print "file is $file\n";
if ($file =~ /^$word(\d*)$ext$/){
   #print "matching is $file\n";
   push (@matching, $1);
   }
}
@matching = sort @matching;
my $winner  = pop @matching;
return $winner  
}
[download]

Did I not promise awful? I don't even got how this code could run without a semi-colon after $winner .

But I have made progress on the central task, and thank you for your response. Until I can find a meaningful name and place for it, I have called it 1.a.pl:

$ history | tail - 10
==> standard input <==
 1987  pt 1.a.pl 
...
 1990  ./1.a.pl 1.k.pl 
 1991  rm 2.k.pl 
 1992  ./1.a.pl 1.k.pl 
 1993  file -i *.pl
 1994  cat 1.manifest 
...
 1996  history | tail - 10
tail: cannot open '10' for reading: No such file or directory
$
[download]

This is getting there:

$ cat 1.a.pl
#!/usr/bin/perl -w
use 5.011;
use Path::Tiny;
use Encode;
use utf8;    # a la François
use open OUT => ':encoding(utf8)';
use open ':std';

#  This script increments and clones the file in $1.

## enabling cyrillic
## decode argv and current

say "argv is @ARGV";

foreach (@ARGV) {
  say "before decode is $_";
  $_ = decode( 'UTF-8', $_ );
  say "after decode is $_";
}

my (@in_files) = @ARGV;

my $current = Path::Tiny->cwd;
$current = decode( 'UTF-8', $current );
say "current is $current";
say "-------------";
say "in_file: @in_files";

for (@in_files) {
  my $tiny_in = path($_);

## use Path::Tiny
  my $file_contents = $tiny_in->slurp_utf8;
  $_ =~ m/^(\d+)(.*)$/;
  my $number    = $1;
  my $rest      = $2;
  my $increment = $number + 1;
  my $new_base  = $increment . $rest;
  say "new base is $new_base";

## use Path::Tiny to create new file
  my $save_file = path( $current, $new_base )->touchpath;
  say "save path is $save_file";
  my $return = $tiny_in->copy($save_file);
  $return->chmod(0755);
  say "return is $return";

## write to local manifest
  my $manifest_name = "1.manifest";
  path($manifest_name)->append_utf8( $new_base . "\n" );
  system "cat $manifest_name";
  system "cat $save_file";

}
$
[download]

It looks like I got the spacing on the manifest file squared away:

$ cat 1.manifest 
2.haukex.pl
3.haukex.pl
4.haukex.pl4.ping3a.pl5.haukex.pl5.ping3a.pl2.k.pl/n2.k.pl
3.k.pl
$
[download]

[reply]
[d/l]
[select]

Re^3: create clone script for utf8 encoding

by kcott (Archbishop) on Dec 19, 2018 at 07:58 UTC

"I don't even got how this code could run without a semi-colon after $winner ."

It's the last line of "sub highest_number { ... }". Technically, that is terminated by the final brace: while this is valid code, I wouldn't recommend it (certainly not for production code). The problem is that if more code is added you'll have a statement that is neither terminated by a brace nor a semicolon.

The same applies to arrays (and hashes) and commas. Consider this contrived example:

my @numbers = (
    'one',
    'three',
    'two'
);
[download]

Oops! I'll just reorder those (a simple ddp sequence in vi):

my @numbers = (
    'one',
    'two'
    'three',
);
[download]

Now you've got an even bigger "Oops!".

Having said that, the biggest, single improvement you could make to this code would be the use of consistent indentation. It took me some time, tracking back and forth between opening and closing braces, to see where the related blocks of code were. This type of code is highly error-prone. Perhaps look at perlstyle; and perltidy may prove useful.

— Ken

[reply]
[d/l]
[select]

Re^4: create clone script for utf8 encoding

by choroba (Cardinal) on Dec 20, 2018 at 15:59 UTC

Re^5: create clone script for utf8 encoding

by kcott (Archbishop) on Dec 21, 2018 at 03:12 UTC

Re^5: create clone script for utf8 encoding

by Aldebaran (Curate) on Jan 03, 2019 at 08:11 UTC

Re^4: create clone script for utf8 encoding

by ikegami (Patriarch) on Dec 20, 2018 at 18:08 UTC

Re: create clone script for utf8 encoding
by dsheroh (Monsignor) on Dec 16, 2018 at 10:16 UTC

Finally, what makes any of these en_**.utf8 encodings different from another?

en_**.utf8

.utf8

Where they differ is in the instructions they provide to programs on how to interpret and display data which may vary depending on what country/culture you are in. For example, in an en_US locale, the currency symbol is $, while in an en_GB locale, it is £. Other things which may vary by locale include sort orders, date formats, decimal and thousands separators, address formats, the wording of system-produced messages, spelling ("color" (en_US) vs. "colour" (en_GB)), and so on.

As you may now see, the locale setting is generally three distinct pieces of information packed into a single string. The en_US.utf8 locale tells programs to use the English language (en), and specifically the US dialect of English and US formatting conventions (_US), and to encode character data with the utf8 encoding scheme (.utf8).

[reply]
[d/l]
[select]

Re: create clone script for utf8 encoding
by ikegami (Patriarch) on Dec 18, 2018 at 03:09 UTC

Note that I said that Perl expects Perl code to be ASCII by default (and that it's tolerant of illegal bytes in string literals). No encoding is used for file handles (e.g. STDOUT) by default.

[reply]

Re: create clone script for utf8 encoding
by haukex (Archbishop) on Dec 25, 2018 at 14:40 UTC

I thought you might be interested in this: I just released "enctool", which will guess and verify files' encodings. For example, to test whether a file which you know contains Cyrillic characters is encoded in UTF-8 or KOI8-R: enctool --encodings=UTF-8,KOI8-R --one-of='\p{Script=Cyrillic}' filename.txt (there are lots of other options too, see the POD - in this case, e.g. --test-all --list-chars --extra-verbose might also be interesting). Although there are tests, I rewrote it pretty much from scratch from an earlier version, so I've still labeled it beta - if there are issues, let me know.

Update: If you work with KOI8-R a lot, you might want to change the default list of encodings, for example, one way is to put this in your ~/.profile: export ENCTOOL_ENCODINGS="ASCII,UTF-8,KOI8-R,Latin1,CP1252"

[reply]
[d/l]
[select]

Re^2: create clone script for utf8 encoding

by Anonymous Monk on Dec 25, 2018 at 16:47 UTC

I knew a person who could intuitively decipher Mojibake that resulted from mishandling single-byte encodings, like this:

KOI8-R CP1251 CP866 ← decoded as...

KOI8-R Привет рТЙЧЕФ Ё╥╔╫┼╘

CP1251 оПХБЕР Привет ╧ЁштхЄ

CP866 ▐Ю╗╒╔Б ЏаЁўҐв Привет

↑ encoded to...

Thankfully, it's not a frequent occasion when we have to resort to sorts of frequency analysis nowadays.

[reply]

Re^3: create clone script for utf8 encoding

by Aldebaran (Curate) on Jan 03, 2019 at 11:05 UTC

Interesting...I didn't know that Mojibake had a name but do have a sense of it. I think of it like being in the weeds.

[reply]