in reply to create clone script for utf8 encoding
You've already gotten some good answers, I just wanted to pick up on a couple more points. In general, you might want to have a look at The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), as well as perluniintro and perlunicode.
when I touch a file into existence, it is us-ascii
touch creates an empty file, and it doesn't have any encoding - a file's encoding is not some kind of metadata attribute secretly attached to a file. A file just contains bytes, and it is up to the reading and writing programs to interpret that sequence of bytes to and from the more abstract concept of "characters" (I am using that term loosely here) on reading and writing. Many software packages have defaults for such encodings, such as Latin-1, CP1252, UTF-8, or UTF-16, but the software packages often don't agree. And of those four examples, only the latter two allow you to encode all valid Unicode code points. As for ASCII, it is a subset of many different character encodings (such as Latin-1, CP1252, and UTF-8), and it only covers bytes with values 0 to 127 (the lower 7 bits of the byte), so it encodes even fewer characters:
+-Dec-Hex-Oct----+-Dec-Hex-Oct----+-Dec-Hex-Oct----+-Dec-Hex-Oct----+
| 0 00 000 ␀ | 32 20 040 ␠ | 64 40 100 @ | 96 60 140 ` |
| 1 01 001 ␁ | 33 21 041 ! | 65 41 101 A | 97 61 141 a |
| 2 02 002 ␂ | 34 22 042 " | 66 42 102 B | 98 62 142 b |
| 3 03 003 ␃ | 35 23 043 # | 67 43 103 C | 99 63 143 c |
| 4 04 004 ␄ | 36 24 044 $ | 68 44 104 D | 100 64 144 d |
| 5 05 005 ␅ | 37 25 045 % | 69 45 105 E | 101 65 145 e |
| 6 06 006 ␆ | 38 26 046 & | 70 46 106 F | 102 66 146 f |
| 7 07 007 ␇ | 39 27 047 ' | 71 47 107 G | 103 67 147 g |
| 8 08 010 ␈ | 40 28 050 ( | 72 48 110 H | 104 68 150 h |
| 9 09 011 ␉ | 41 29 051 ) | 73 49 111 I | 105 69 151 i |
| 10 0A 012 ␊ | 42 2A 052 * | 74 4A 112 J | 106 6A 152 j |
| 11 0B 013 ␋ | 43 2B 053 + | 75 4B 113 K | 107 6B 153 k |
| 12 0C 014 ␌ | 44 2C 054 , | 76 4C 114 L | 108 6C 154 l |
| 13 0D 015 ␍ | 45 2D 055 - | 77 4D 115 M | 109 6D 155 m |
| 14 0E 016 ␎ | 46 2E 056 . | 78 4E 116 N | 110 6E 156 n |
| 15 0F 017 ␏ | 47 2F 057 / | 79 4F 117 O | 111 6F 157 o |
| 16 10 020 ␐ | 48 30 060 0 | 80 50 120 P | 112 70 160 p |
| 17 11 021 ␑ | 49 31 061 1 | 81 51 121 Q | 113 71 161 q |
| 18 12 022 ␒ | 50 32 062 2 | 82 52 122 R | 114 72 162 r |
| 19 13 023 ␓ | 51 33 063 3 | 83 53 123 S | 115 73 163 s |
| 20 14 024 ␔ | 52 34 064 4 | 84 54 124 T | 116 74 164 t |
| 21 15 025 ␕ | 53 35 065 5 | 85 55 125 U | 117 75 165 u |
| 22 16 026 ␖ | 54 36 066 6 | 86 56 126 V | 118 76 166 v |
| 23 17 027 ␗ | 55 37 067 7 | 87 57 127 W | 119 77 167 w |
| 24 18 030 ␘ | 56 38 070 8 | 88 58 130 X | 120 78 170 x |
| 25 19 031 ␙ | 57 39 071 9 | 89 59 131 Y | 121 79 171 y |
| 26 1A 032 ␚ | 58 3A 072 : | 90 5A 132 Z | 122 7A 172 z |
| 27 1B 033 ␛ | 59 3B 073 ; | 91 5B 133 [ | 123 7B 173 { |
| 28 1C 034 ␜ | 60 3C 074 < | 92 5C 134 \ | 124 7C 174 | |
| 29 1D 035 ␝ | 61 3D 075 = | 93 5D 135 ] | 125 7D 175 } |
| 30 1E 036 ␞ | 62 3E 076 > | 94 5E 136 ^ | 126 7E 176 ~ |
| 31 1F 037 ␟ | 63 3F 077 ? | 95 5F 137 _ | 127 7F 177 ␡ |
+----------------+----------------+----------------+----------------+
Code I used to generate that table:
use warnings; use strict; use open qw/:std :utf8/; print "+", "-Dec-Hex-Oct----+"x4, "\n"; for my $y (0..0x1F) { print "|"; for my $c (map {$y|$_} 0x00,0x20,0x40,0x60) { printf " %3d %02X %03o %s |", $c, $c, $c, chr( $c<0x21 ? 0x2400+$c : $c==0x7F ? 0x2421 : $c ); } print "\n"; } print "+", (("-"x16)."+")x4, "\n";
(For display, I'm using the Unicode Control Pictures in the U+2400 range to represent the nonprintable characters 0x00-0x20 and 0x7F.)
For example, the Euro symbol € ("\x{20AC}" or "\N{U+20AC}" in Perl) is:
(Copied from my post here.) I wrote some more about the whole topic here.
files that are formed from redirecting STDOUT begin their lives as us-ascii on this platform.
As the AM post explained, that's unlikely here, since "Ü" is not representable in ASCII (which is also what iconv is telling you with its error) - most likely it's UTF-8, but you can check by piping the output to e.g. hexdump - for example, on my terminal echo -n "€" | hexdump -C shows the bytes e2 82 ac, and as I showed above, that's the UTF-8 encoding. If you're really unsure of a file's encoding, there's Encode::Guess (I showed an example here), keeping in mind that it's just guessing.
I'd like to write a perl equivalent that would give me freedom to choose the underlying encoding.
If you're talking about the Perl source code itself, IMHO really the only two useful choices are plain ASCII, or UTF-8, and in the latter case, you have to tell Perl by adding use utf8; at the top of your file (see utf8). If your Perl source code is in ASCII, you can still represent Unicode characters in strings and regexes using escapes like "\x{...}" and "\N{...}" (see also charnames). And since ASCII is a subset of UTF-8, if you stick to those two encodings for your Perl source, your "clone" script doesn't have anything to worry about, it can just cp the files, all you need to do is add the use utf8; when appropriate. Just make sure that whatever editor you're using to work on your Perl scripts uses UTF-8 when it saves the files.
If you're talking about files that your Perl program is reading and writing, you'd specify those encodings with the three-argument open (which I'd recommend), with binmode, or set defaults with the open pragma (the latter is useful for changing the encoding of the STDIN/OUT/ERR streams as well). For en-/decoding strings of bytes you've already got in Perl, there's the Encode family of modules, plus for UTF-8, utf8::encode() and utf8::decode(). There's also the -C command-line switch (which I'd mostly only use for oneliners) and the PERLIO environment variable (which I've almost never had a need for), see perlrun.
BTW, you can do the same thing as iconv with Perl:
use warnings; use strict; # iconv -f UTF-8 utf8.txt -t Latin9 -o latin9.txt my ($ifile, $ienc) = ("utf8.txt", "UTF-8"); my ($ofile, $oenc) = ("latin9.txt", "Latin9"); open my $ifh, "<:raw:encoding($ienc)", $ifile or die "$ifile: $!"; open my $ofh, ">:raw:encoding($oenc)", $ofile or die "$ofile: $!"; print $ofh do { local $/; <$ifh> }; close $ifh; close $ofh;
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: create clone script for utf8 encoding
by Aldebaran (Curate) on Dec 19, 2018 at 04:29 UTC | |
by haukex (Archbishop) on Dec 19, 2018 at 10:15 UTC | |
by Aldebaran (Curate) on Dec 19, 2018 at 21:23 UTC |