Re: create clone script for utf8 encoding

You've already gotten some good answers, I just wanted to pick up on a couple more points. In general, you might want to have a look at The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), as well as perluniintro and perlunicode.

when I touch a file into existence, it is us-ascii

touch creates an empty file, and it doesn't have any encoding - a file's encoding is not some kind of metadata attribute secretly attached to a file. A file just contains bytes, and it is up to the reading and writing programs to interpret that sequence of bytes to and from the more abstract concept of "characters" (I am using that term loosely here) on reading and writing. Many software packages have defaults for such encodings, such as Latin-1, CP1252, UTF-8, or UTF-16, but the software packages often don't agree. And of those four examples, only the latter two allow you to encode all valid Unicode code points. As for ASCII, it is a subset of many different character encodings (such as Latin-1, CP1252, and UTF-8), and it only covers bytes with values 0 to 127 (the lower 7 bits of the byte), so it encodes even fewer characters:

+-Dec-Hex-Oct----+-Dec-Hex-Oct----+-Dec-Hex-Oct----+-Dec-Hex-Oct----+
|   0  00 000 ␀  |  32  20 040 ␠  |  64  40 100 @  |  96  60 140 `  |
|   1  01 001 ␁  |  33  21 041 !  |  65  41 101 A  |  97  61 141 a  |
|   2  02 002 ␂  |  34  22 042 "  |  66  42 102 B  |  98  62 142 b  |
|   3  03 003 ␃  |  35  23 043 #  |  67  43 103 C  |  99  63 143 c  |
|   4  04 004 ␄  |  36  24 044 $  |  68  44 104 D  | 100  64 144 d  |
|   5  05 005 ␅  |  37  25 045 %  |  69  45 105 E  | 101  65 145 e  |
|   6  06 006 ␆  |  38  26 046 &  |  70  46 106 F  | 102  66 146 f  |
|   7  07 007 ␇  |  39  27 047 '  |  71  47 107 G  | 103  67 147 g  |
|   8  08 010 ␈  |  40  28 050 (  |  72  48 110 H  | 104  68 150 h  |
|   9  09 011 ␉  |  41  29 051 )  |  73  49 111 I  | 105  69 151 i  |
|  10  0A 012 ␊  |  42  2A 052 *  |  74  4A 112 J  | 106  6A 152 j  |
|  11  0B 013 ␋  |  43  2B 053 +  |  75  4B 113 K  | 107  6B 153 k  |
|  12  0C 014 ␌  |  44  2C 054 ,  |  76  4C 114 L  | 108  6C 154 l  |
|  13  0D 015 ␍  |  45  2D 055 -  |  77  4D 115 M  | 109  6D 155 m  |
|  14  0E 016 ␎  |  46  2E 056 .  |  78  4E 116 N  | 110  6E 156 n  |
|  15  0F 017 ␏  |  47  2F 057 /  |  79  4F 117 O  | 111  6F 157 o  |
|  16  10 020 ␐  |  48  30 060 0  |  80  50 120 P  | 112  70 160 p  |
|  17  11 021 ␑  |  49  31 061 1  |  81  51 121 Q  | 113  71 161 q  |
|  18  12 022 ␒  |  50  32 062 2  |  82  52 122 R  | 114  72 162 r  |
|  19  13 023 ␓  |  51  33 063 3  |  83  53 123 S  | 115  73 163 s  |
|  20  14 024 ␔  |  52  34 064 4  |  84  54 124 T  | 116  74 164 t  |
|  21  15 025 ␕  |  53  35 065 5  |  85  55 125 U  | 117  75 165 u  |
|  22  16 026 ␖  |  54  36 066 6  |  86  56 126 V  | 118  76 166 v  |
|  23  17 027 ␗  |  55  37 067 7  |  87  57 127 W  | 119  77 167 w  |
|  24  18 030 ␘  |  56  38 070 8  |  88  58 130 X  | 120  78 170 x  |
|  25  19 031 ␙  |  57  39 071 9  |  89  59 131 Y  | 121  79 171 y  |
|  26  1A 032 ␚  |  58  3A 072 :  |  90  5A 132 Z  | 122  7A 172 z  |
|  27  1B 033 ␛  |  59  3B 073 ;  |  91  5B 133 [  | 123  7B 173 {  |
|  28  1C 034 ␜  |  60  3C 074 <  |  92  5C 134 \  | 124  7C 174 |  |
|  29  1D 035 ␝  |  61  3D 075 =  |  93  5D 135 ]  | 125  7D 175 }  |
|  30  1E 036 ␞  |  62  3E 076 >  |  94  5E 136 ^  | 126  7E 176 ~  |
|  31  1F 037 ␟  |  63  3F 077 ?  |  95  5F 137 _  | 127  7F 177 ␡  |
+----------------+----------------+----------------+----------------+

Code I used to generate that table:

use warnings;
use strict;
use open qw/:std :utf8/;

print "+", "-Dec-Hex-Oct----+"x4, "\n";
for my $y (0..0x1F) {
    print "|";
    for my $c (map {$y|$_} 0x00,0x20,0x40,0x60) {
        printf " %3d  %02X %03o %s  |", $c, $c, $c,
            chr( $c<0x21 ? 0x2400+$c : $c==0x7F ? 0x2421 : $c );
    }
    print "\n";
}
print "+", (("-"x16)."+")x4, "\n";
[download]

(For display, I'm using the Unicode Control Pictures in the U+2400 range to represent the nonprintable characters 0x00-0x20 and 0x7F.)

For example, the Euro symbol € ("\x{20AC}" or "\N{U+20AC}" in Perl) is:

Not representable in ASCII or Latin1 (aka ISO/IEC 8859-1)
In Latin-9 (aka ISO/IEC 8859-15) it is the byte 0xA4
In Windows-1252 (aka CP-1252 and sometimes aka "ANSI") it is the byte 0x80
In MacRoman it is the byte 0xDB (at least Mac OS >=8.5)
In UTF-8 it is the bytes 0xE2, 0x82, 0xAC
In UTF-16 it is 0x20AC - depending on endianness, bytes 0x20, 0xAC in UTF-16-BE, or 0xAC, 0x20 in UTF-16-LE

(Copied from my post here.) I wrote some more about the whole topic here.

files that are formed from redirecting STDOUT begin their lives as us-ascii on this platform.

As the AM post explained, that's unlikely here, since "Ü" is not representable in ASCII (which is also what iconv is telling you with its error) - most likely it's UTF-8, but you can check by piping the output to e.g. hexdump - for example, on my terminal echo -n "€" | hexdump -C shows the bytes e2 82 ac, and as I showed above, that's the UTF-8 encoding. If you're really unsure of a file's encoding, there's Encode::Guess (I showed an example here), keeping in mind that it's just guessing.

I'd like to write a perl equivalent that would give me freedom to choose the underlying encoding.

If you're talking about the Perl source code itself, IMHO really the only two useful choices are plain ASCII, or UTF-8, and in the latter case, you have to tell Perl by adding use utf8; at the top of your file (see utf8). If your Perl source code is in ASCII, you can still represent Unicode characters in strings and regexes using escapes like "\x{...}" and "\N{...}" (see also charnames). And since ASCII is a subset of UTF-8, if you stick to those two encodings for your Perl source, your "clone" script doesn't have anything to worry about, it can just cp the files, all you need to do is add the use utf8; when appropriate. Just make sure that whatever editor you're using to work on your Perl scripts uses UTF-8 when it saves the files.

If you're talking about files that your Perl program is reading and writing, you'd specify those encodings with the three-argument open (which I'd recommend), with binmode, or set defaults with the open pragma (the latter is useful for changing the encoding of the STDIN/OUT/ERR streams as well). For en-/decoding strings of bytes you've already got in Perl, there's the Encode family of modules, plus for UTF-8, utf8::encode() and utf8::decode(). There's also the -C command-line switch (which I'd mostly only use for oneliners) and the PERLIO environment variable (which I've almost never had a need for), see perlrun.

BTW, you can do the same thing as iconv with Perl:

use warnings;
use strict;

# iconv -f UTF-8 utf8.txt -t Latin9 -o latin9.txt
my ($ifile, $ienc) = ("utf8.txt",   "UTF-8");
my ($ofile, $oenc) = ("latin9.txt", "Latin9");
open my $ifh, "<:raw:encoding($ienc)", $ifile or die "$ifile: $!";
open my $ofh, ">:raw:encoding($oenc)", $ofile or die "$ofile: $!";
print $ofh do { local $/; <$ifh> };
close $ifh;
close $ofh;
[download]

Comment on Re: create clone script for utf8 encoding Select or Download Code

Replies are listed 'Best First'.
Re^2: create clone script for utf8 encoding by Aldebaran (Curate) on Dec 19, 2018 at 04:29 UTC
a file's encoding is not some kind of metadata attribute secretly attached to a file. A file just contains bytes, and it is up to the reading and writing programs to interpret that sequence of bytes to and from the more abstract concept of "characters" (I am using that term loosely here) on reading and writing. Am I correct that what my OS is telling me is its best guess as to how to interpret this file and have it make any sense? $ file -i *.pl 18.clone.pl: text/x-perl; charset=us-ascii 1.a.pl: text/x-perl; charset=utf-8 1.haukex.pl: text/x-perl; charset=us-ascii 1.k.pl: text/x-perl; charset=us-ascii 2.haukex.pl: text/x-perl; charset=us-ascii 3.haukex.pl: text/x-perl; charset=utf-8 3.ping3a.pl: text/x-perl; charset=us-ascii 4.haukex.pl: text/x-perl; charset=utf-8 4.ping3a.pl: text/x-perl; charset=us-ascii 5.ping3a.pl: text/x-perl; charset=us-ascii $ [download] What seems to be very much the case is that the OS thinks the doc is utf8 if there are utf8 non-ascii characters in it. I did nothing with the #.haukex scripts to change from us-ascii to utf8 but begin to include cyrillic characters, like so with pre tags: $ ./1.a.pl 3.haukex.pl argv is 3.haukex.pl before decode is 3.haukex.pl after decode is 3.haukex.pl current is /home/bob/2.scripts/pages/1.cw/template_stuff/translations/rus.cw ------------- in_file: 3.haukex.pl new base is 4.haukex.pl save path is /home/bob/2.scripts/pages/1.cw/template_stuff/translations/rus.cw/4.haukex.pl return is /home/bob/2.scripts/pages/1.cw/template_stuff/translations/rus.cw/4.haukex.pl 2.haukex.pl 3.haukex.pl 4.haukex.pl#!/usr/bin/perl -w use 5.011; use Carp; use Data::Alias 'alias'; use Data::Dumper; use utf8; # a la François use open OUT => ':encoding(utf8)'; use open ':std'; sub rangeparse { local $_ = shift; my @o; # row1,col1, row2,col2 (-1 = last row/col) if (@o=/\AR(0-9+\|n)C(0-9+\|n):R(0-9+\|n)C(0-9+\|n)\z/) {} elsif (/\AR(0-9+\|n):R(0-9+\|n)\z/) { @o=($1,1,$2,-1) } elsif (/\AC(0-9+\|n):C(0-9+\|n)\z/) { @o=(1,$1,-1,$2) } elsif (/\AR(0-9+\|n)C(0-9+\|n)\z/) { @o=($1,$2,$1,$2) } elsif (/\AR(0-9+\|n)\z/) { @o=($1,1,$1,-1) } elsif (/\AC(0-9+\|n)\z/) { @o=(1,$1,-1,$1) } else { croak "failed to parse '$_'" } $_ eq 'n' and $_=-1 for @o; return \@o; } use Test::More tests=>2; is_deeply rangeparse("RnC2:RnC5"), -1, 2, -1, 5 ; is_deeply rangeparse("R3C2:RnCn"), 3, 2, -1,-1 ; my $data = ['й', ' ', ' ', 'л', ' ', ' ', 'с', ' ', ' ', 1..9]; say Dumper $data; sub getsubset { my ($data,$range) = @_; my $cols = @{$$data[0]}; @$_==$cols or croak "data not rectangular" for @$data; $range = rangeparse($range) unless ref $range eq 'ARRAY'; @$range==4 or croak "bad size of range"; my @max = (0+@$data,$cols)x2; for my $i (0..3) { $$range$i=$max$i if $$range$i<0; croak "index $i out of range" if $$range$i<1 \|\| $$range$i>$max$i; } croak "bad rows $$range[0]-$$range2" if $$range[0]>$$range2; croak "bad cols $$range1-$$range3" if $$range1>$$range3; my @cis = $$range1-1 .. $$range3-1; return [ map { sub{\@_}->(@{$$data$_}@cis) } $$range[0]-1 .. $$range2-1 ] } This is a trimmed down version of haukex's result in Selecting Ranges of 2-Dimensional Data. I'm populating it with cyrillic values and hope to run some tests, but I still want to get this clone tool squared away. Still working through other parts of your post....	[reply] [d/l]
Re^3: create clone script for utf8 encoding by haukex (Archbishop) on Dec 19, 2018 at 10:15 UTC
Am I correct that what my OS is telling me is its best guess as to how to interpret this file and have it make any sense? Yes, with the emphasis being that it's just a guess. the OS thinks the doc is utf8 if there are utf8 non-ascii characters in it. Yes, and it might be important to note that there are certain sequences of bytes that are not valid UTF-8 (see e.g. UTF-8), which means that in some cases, it's possible to differentiate between random bytes and valid UTF-8 text. Also, just to nitpick a little, it's not the OS guessing the file's encoding, it's the file tool. I did nothing with the #.haukex scripts to change from us-ascii to utf8 but begin to include cyrillic characters Note that if you have a file that is originally ASCII and you add non-ASCII characters to it, it's up to the editor to choose which encoding it will use when saving the file. Many editors will default to UTF-8, but some may not! with pre tags You may have noticed that when using `<pre>` tags, you have to escape square brackets, `[` is `[` and `]` is `]`. Update: Improved wording of first paragraph.	[reply] [d/l]
Re^4: create clone script for utf8 encoding by Aldebaran (Curate) on Dec 19, 2018 at 21:23 UTC
it might be important to note that there are certain sequences of bytes that are not valid UTF-8 (see e.g. UTF-8), which means that in some cases, it's possible to differentiate between random bytes and valid UTF-8 text. I see. Also, just to nitpick a little, it's not the OS guessing the file's encoding, it's the file tool. Thank you for the delousifying reference at file. I pulled out what I thought was relevant. I've "known" this before, but if you get behind on reading, things change: Read more... (2 kB) You may have noticed that when using pre tags, you have to escape square brackets I do now. Life is like a box of chocolates with pre tags for this particular forrest gump. The engine that parses the xml is gonna look at [ ] and create a hyperlink, isn't it? I think I'm gonna go back to code tags, even when content has cyrillic. Others can make a clean download without having to copy and paste off the screen.	[reply]