Re^2: create clone script for utf8 encoding

Strictly speaking, that depends both on the program that created the file and your interpretation of it. For example, printf '\xf0\xd2\xc9\xd7\xc5\xd4\n' > file would create a file filled with bytes, that, when interpreted as KOI8-R (iconv -f koi8-r file), would translate to a greeting in Russian.

Спасибо, анонимный монах. I try to run all the posted source on threads where I'm OP, and I was very pleased to run yours and have an iconv command that worked 100 percent. The command gave me a lot of partial credit for failed attempts, which helped diagnose the way. I sense that you are experienced with cyrillic encodings, so I'm very happy to have your attention to my issues, which must seem parochial by your standards.

$ printf '\xf0\xd2\xc9\xd7\xc5\xd4\n' > 1.file 
$ iconv -f koi8-r 1.file
&#1055;&#1088;&#1080;&#1074;&#1077;&#1090;
$ iconv -f koi8-r 1.file -o 1.prubyet
$ file 1.prubyet 
1.prubyet: UTF-8 Unicode text
$ cat 1.prubyet 
&#1055;&#1088;&#1080;&#1074;&#1077;&#1090;
$ cat 1.file
&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;
$ file 1.file
1.file: ISO-8859 text
[download]

I know how these look to in the terminal and in my editor. 3.file shows the cyrillic greeting. 1.prubyet has six diamonds with question marks in the middle.

I wondered what diff would think of them:

$ echo &#1055;&#1088;&#1080;&#1074;&#1077;&#1090; >3.file
$ diff 1.file 3.file
1c1
< &#65533;&#65533;&#65533;&#65533;&#65533;&#65533;
---
> &#1055;&#1088;&#1080;&#1074;&#1077;&#1090;
$
[download]

I'm looking at 1.file and 3.file in the hex editor. 1.file was exactly what I expected, but 3.file has one value more than the 12 I expected. (?)

D0 9F D1 80 D0 B8 D0 B2 D0 B5 D1 82 0A

I'd hoped this renders faithfully with monastery code tags. Do I gather that code tags unravel things that aren't us-ascii? Has anyone ever suggested having a form of code tag that did not do this?

Comment on Re^2: create clone script for utf8 encoding Select or Download Code

Replies are listed 'Best First'.
Re^3: create clone script for utf8 encoding by Anonymous Monk on Dec 16, 2018 at 22:43 UTC
use `<pre>` instead of `<code>` for unicode	[reply] [d/l] [select]
Re^4: create clone script for utf8 encoding by Aldebaran (Curate) on Dec 17, 2018 at 23:42 UTC
use pre instead of code for unicode Sure enough...this is a repeat of the diff command with pre tags: $ echo Привет >3.file $ diff 1.file 3.file 1c1 < �� --- > Привет $ Hmmm, well there it is. I tried pre tags in the writeup but must not have pasted it in and previewed correctly. There is something to learn from seeing the numerical representations of these characters. Indeed, I was surprised that 65533 * 6 was what diff thought 1.file was. It is the unicode replacement character: U+FFFD. Further reading and clarification here: unicode specials How did you get single code and pre tags to display (surrounded by <>) and not foul the legibility? Also, is there a way to employ the diff command so that the equality in these files could be established? (not essential or vital to this coding task)	[reply]
Re^5: create clone script for utf8 encoding by Anonymous Monk on Dec 18, 2018 at 08:52 UTC
I tried pre tags in the writeup but must not have pasted it in and previewed correctly. PerlMonks engine automatically replaces all symbols not representable in ASCII by their HTML entity codes: `ы` → `ы`. The `<code>` are special non-HTML tags that don't allow HTML entities inside them to be interpreted, but the transformation still takes place. (How did I write that? `<tt>ы</tt> &rightarrow; <c>ы</c>` and let PerlMonks make the replacement, knowing that the entity code inside `<tt>...</tt>` will be interpreted back into `ы`, while the one inside `<c>...</c>` won't. How did I write what I just wrote? Lots of `<`s and `<code>` = `<c>` equivalence.) It is the unicode replacement character: U+FFFD. The replacement character is what happens when your terminal emulator tries to decode KOI8-R-encoded bytes as UTF-8 and fails. The actual output of diff contains both KOI8-R- and UTF-8- encoded bytes and can be decoded as KOI8-R: $ diff 1.file 3.file \| iconv -f koi8-r 1c1 < Привет --- > п÷я─п╦п╡п╣я┌	[reply] [d/l] [select]
Re^6: create clone script for utf8 encoding by ikegami (Patriarch) on Dec 20, 2018 at 17:34 UTC
Re^3: create clone script for utf8 encoding by Anonymous Monk on Dec 17, 2018 at 09:49 UTC
3.file has one value more than the 12 I expected. (?) The `0A` at the end is the newline, `"\n"`. If you omit it, the shell prompt will be printed on the same line as the text: username@localhost:~$ printf '\xf0\xd2\xc9\xd7\xc5\xd4' \| iconv -f koi8-r Приветusername@localhost:~$ Together with carriage return `"\r"`, this can be used to produce various effects on the console. For example, the following program prints two different strings, but after it's finished the terminal will look like it didn't print anything: `perl -e '$\|=1; print "Now you see me!"; sleep 1; print "\r"; print "No +w you don\x27t! "; sleep 1; printf "\r"'` [download] (Actually, you may see part of its output if your shell prompt is short enough. For more honest but less portable version, see man console_codes.)	[reply] [d/l] [select]