comment on

I'm running ubuntu with bash, and when I touch a file into existence, it is us-ascii. Likewise, files that are formed from redirecting STDOUT begin their lives as us-ascii on this platform. Where is this determined on POSIX systems?

Strictly speaking, that depends both on the program that created the file and your interpretation of it. For example, printf '\xf0\xd2\xc9\xd7\xc5\xd4\n' > file would create a file filled with bytes, that, when interpreted as KOI8-R (iconv -f koi8-r file), would translate to a greeting in Russian.

It just so happens that when you type text on your English keyboard and it's encoded into bytes according to the rules defined by your locale, its UTF8-encoded bytes (Ubuntu has been UTF-8 by default for years) have the same meaning if you decode them as ASCII. UTF-8 has been designed to be "backwards compatible" to ASCII when it comes to the first 128 code points.

$ iconv -f us-ascii -t UTF-8 2.ascii.de.txt -o 2.de.utf8.txt
iconv: illegal input sequence at position 0
[download]

Does ascii have a representation for Ü?

No. If you consult the ASCII table, you will see that it only defines glyphs corresponding to byte values 0..127. With 26*2 letters + 10 digits + 32 control characters to be interpreted by teletypes (or terminal emulators) there is only enough space for some punctuation marks, but no accented characters. Single-byte encodings like ISO-8859-1 or KOI8-R use the byte values 128..255 for that.

If you run file 2.ascii.de.txt, you will see that it's actually UTF-8. file can also discern pure ASCII files - because they don't have any bytes above 127 - but cannot discern different single-byte non-ASCII encodings. Those can contain any byte values, and you have to know statistics about the languages used for those encodings to guess - not 100% right - which language and which encoding it is. UTF-8 can also contain any byte values, but the bytes always follow specific rules which can be easily checked.

Finally, what makes any of these en_**.utf8 encodings different from another?

Those are locales, not encodings. The encoding specified by most of the locales is UTF-8, but the underlying locale settings like number format (decimal dot or comma?), date-time format (Y-m-d or m/d/y? 12 hours or 24 hours?), string collation rules (yes, the way we sort strings depends on the language they are in), etc) are different.

Удачи,

In reply to Re: create clone script for utf8 encoding by Anonymous Monk
in thread create clone script for utf8 encoding by Aldebaran

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.