You've already gotten some good answers, I just wanted to pick up on a couple more points. In general, you might want to have a look at The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), as well as perluniintro and perlunicode.

when I touch a file into existence, it is us-ascii

touch creates an empty file, and it doesn't have any encoding - a file's encoding is not some kind of metadata attribute secretly attached to a file. A file just contains bytes, and it is up to the reading and writing programs to interpret that sequence of bytes to and from the more abstract concept of "characters" (I am using that term loosely here) on reading and writing. Many software packages have defaults for such encodings, such as Latin-1, CP1252, UTF-8, or UTF-16, but the software packages often don't agree. And of those four examples, only the latter two allow you to encode all valid Unicode code points. As for ASCII, it is a subset of many different character encodings (such as Latin-1, CP1252, and UTF-8), and it only covers bytes with values 0 to 127 (the lower 7 bits of the byte), so it encodes even fewer characters:

+-Dec-Hex-Oct----+-Dec-Hex-Oct----+-Dec-Hex-Oct----+-Dec-Hex-Oct----+
|   0  00 000 ␀  |  32  20 040 ␠  |  64  40 100 @  |  96  60 140 `  |
|   1  01 001 ␁  |  33  21 041 !  |  65  41 101 A  |  97  61 141 a  |
|   2  02 002 ␂  |  34  22 042 "  |  66  42 102 B  |  98  62 142 b  |
|   3  03 003 ␃  |  35  23 043 #  |  67  43 103 C  |  99  63 143 c  |
|   4  04 004 ␄  |  36  24 044 $  |  68  44 104 D  | 100  64 144 d  |
|   5  05 005 ␅  |  37  25 045 %  |  69  45 105 E  | 101  65 145 e  |
|   6  06 006 ␆  |  38  26 046 &  |  70  46 106 F  | 102  66 146 f  |
|   7  07 007 ␇  |  39  27 047 '  |  71  47 107 G  | 103  67 147 g  |
|   8  08 010 ␈  |  40  28 050 (  |  72  48 110 H  | 104  68 150 h  |
|   9  09 011 ␉  |  41  29 051 )  |  73  49 111 I  | 105  69 151 i  |
|  10  0A 012 ␊  |  42  2A 052 *  |  74  4A 112 J  | 106  6A 152 j  |
|  11  0B 013 ␋  |  43  2B 053 +  |  75  4B 113 K  | 107  6B 153 k  |
|  12  0C 014 ␌  |  44  2C 054 ,  |  76  4C 114 L  | 108  6C 154 l  |
|  13  0D 015 ␍  |  45  2D 055 -  |  77  4D 115 M  | 109  6D 155 m  |
|  14  0E 016 ␎  |  46  2E 056 .  |  78  4E 116 N  | 110  6E 156 n  |
|  15  0F 017 ␏  |  47  2F 057 /  |  79  4F 117 O  | 111  6F 157 o  |
|  16  10 020 ␐  |  48  30 060 0  |  80  50 120 P  | 112  70 160 p  |
|  17  11 021 ␑  |  49  31 061 1  |  81  51 121 Q  | 113  71 161 q  |
|  18  12 022 ␒  |  50  32 062 2  |  82  52 122 R  | 114  72 162 r  |
|  19  13 023 ␓  |  51  33 063 3  |  83  53 123 S  | 115  73 163 s  |
|  20  14 024 ␔  |  52  34 064 4  |  84  54 124 T  | 116  74 164 t  |
|  21  15 025 ␕  |  53  35 065 5  |  85  55 125 U  | 117  75 165 u  |
|  22  16 026 ␖  |  54  36 066 6  |  86  56 126 V  | 118  76 166 v  |
|  23  17 027 ␗  |  55  37 067 7  |  87  57 127 W  | 119  77 167 w  |
|  24  18 030 ␘  |  56  38 070 8  |  88  58 130 X  | 120  78 170 x  |
|  25  19 031 ␙  |  57  39 071 9  |  89  59 131 Y  | 121  79 171 y  |
|  26  1A 032 ␚  |  58  3A 072 :  |  90  5A 132 Z  | 122  7A 172 z  |
|  27  1B 033 ␛  |  59  3B 073 ;  |  91  5B 133 [  | 123  7B 173 {  |
|  28  1C 034 ␜  |  60  3C 074 <  |  92  5C 134 \  | 124  7C 174 |  |
|  29  1D 035 ␝  |  61  3D 075 =  |  93  5D 135 ]  | 125  7D 175 }  |
|  30  1E 036 ␞  |  62  3E 076 >  |  94  5E 136 ^  | 126  7E 176 ~  |
|  31  1F 037 ␟  |  63  3F 077 ?  |  95  5F 137 _  | 127  7F 177 ␡  |
+----------------+----------------+----------------+----------------+

Code I used to generate that table:

use warnings; use strict; use open qw/:std :utf8/; print "+", "-Dec-Hex-Oct----+"x4, "\n"; for my $y (0..0x1F) { print "|"; for my $c (map {$y|$_} 0x00,0x20,0x40,0x60) { printf " %3d %02X %03o %s |", $c, $c, $c, chr( $c<0x21 ? 0x2400+$c : $c==0x7F ? 0x2421 : $c ); } print "\n"; } print "+", (("-"x16)."+")x4, "\n";

(For display, I'm using the Unicode Control Pictures in the U+2400 range to represent the nonprintable characters 0x00-0x20 and 0x7F.)

For example, the Euro symbol € ("\x{20AC}" or "\N{U+20AC}" in Perl) is:

(Copied from my post here.) I wrote some more about the whole topic here.

files that are formed from redirecting STDOUT begin their lives as us-ascii on this platform.

As the AM post explained, that's unlikely here, since "Ü" is not representable in ASCII (which is also what iconv is telling you with its error) - most likely it's UTF-8, but you can check by piping the output to e.g. hexdump - for example, on my terminal echo -n "€" | hexdump -C shows the bytes e2 82 ac, and as I showed above, that's the UTF-8 encoding. If you're really unsure of a file's encoding, there's Encode::Guess (I showed an example here), keeping in mind that it's just guessing.

I'd like to write a perl equivalent that would give me freedom to choose the underlying encoding.

If you're talking about the Perl source code itself, IMHO really the only two useful choices are plain ASCII, or UTF-8, and in the latter case, you have to tell Perl by adding use utf8; at the top of your file (see utf8). If your Perl source code is in ASCII, you can still represent Unicode characters in strings and regexes using escapes like "\x{...}" and "\N{...}" (see also charnames). And since ASCII is a subset of UTF-8, if you stick to those two encodings for your Perl source, your "clone" script doesn't have anything to worry about, it can just cp the files, all you need to do is add the use utf8; when appropriate. Just make sure that whatever editor you're using to work on your Perl scripts uses UTF-8 when it saves the files.

If you're talking about files that your Perl program is reading and writing, you'd specify those encodings with the three-argument open (which I'd recommend), with binmode, or set defaults with the open pragma (the latter is useful for changing the encoding of the STDIN/OUT/ERR streams as well). For en-/decoding strings of bytes you've already got in Perl, there's the Encode family of modules, plus for UTF-8, utf8::encode() and utf8::decode(). There's also the -C command-line switch (which I'd mostly only use for oneliners) and the PERLIO environment variable (which I've almost never had a need for), see perlrun.

BTW, you can do the same thing as iconv with Perl:

use warnings; use strict; # iconv -f UTF-8 utf8.txt -t Latin9 -o latin9.txt my ($ifile, $ienc) = ("utf8.txt", "UTF-8"); my ($ofile, $oenc) = ("latin9.txt", "Latin9"); open my $ifh, "<:raw:encoding($ienc)", $ifile or die "$ifile: $!"; open my $ofh, ">:raw:encoding($oenc)", $ofile or die "$ofile: $!"; print $ofh do { local $/; <$ifh> }; close $ifh; close $ofh;


In reply to Re: create clone script for utf8 encoding by haukex
in thread create clone script for utf8 encoding by Aldebaran

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.