comment on

Just as soon as ikegami made me aware that us-ascii is my default encoding, I seem to be developing problems with it. These problems are in the variety of my computer not behaving the way I expect it to.

I'm running ubuntu with bash, and when I touch a file into existence, it is us-ascii. Likewise, files that are formed from redirecting STDOUT begin their lives as us-ascii on this platform. Where is this determined on POSIX systems?

So this is a day in the life, where I use this nifty software: translate shell

$ trans :de -brief "over" >2.ascii.de.txt
$ trans :de -brief "He must." >>2.ascii.de.txt
$ cat 2.ascii.de.txt 
Über
Er muss.
$ iconv -f us-ascii -t UTF-8 2.ascii.de.txt -o 2.de.utf8.txt
iconv: illegal input sequence at position 0
$
[download]

On STDOUT for me, I get Ü as the zeroth character. Does ascii have a representation for Ü?

Then I keep trying to get an iconv command to do something for me, but an effective syntax eludes me. Why is Ü illegal in the iconv command?

If I'm going to have source that has utf8 characters in it, doesn't it make sense to change the underlying encoding to utf8 or create it that way from the git-go?

After I've touched a file into existence, I use a bash script to clone the next version of a script. All of my scripts have a taxonomy of a positive integer followed by a period, followed by a word. The cloned script is incremented, given execute privileges, and has its name written to a manifest. There isn't any language in it for determining the underlying encoding. I've gotten a lot of mileage out of this script, but I think it's time that I need to replace it with shiny new, lexical perl. I'll put it in readmore tags for being somewhat OT:

$ cat 2.create.bash 
#!/bin/bash

# which bash version?

echo "The shebang is specifying bash"
if [ -z "${BASH_VERSION}" ]; then
        echo "Not using bash but dash"
else
        echo "Using bash ${BASH_VERSION}"
fi 
#get the the first number from $1
#c=$(("$1" : '\([0-9]*\).*$')) didn't work
c=$(expr "$1" : '\([0-9]*\).*$') 
echo $c
f=$1

#integer addition
d=$(expr $c + 1)
echo $d

#munge new file, no clobber
t="$d"
q=${f#*.}
s=$t.$q
echo $s
cp -n $f $s
chmod +x $s
echo $s >> 1.manifest

ls -lh $s
gedit $s &
$
[download]

I'd like to write a perl equivalent that would give me freedom to choose the underlying encoding. I'd show previous attempts, but they look awful.

Finally, what makes any of these en_**.utf8 encodings different from another?

$ locale charmap
UTF-8
$ locale -a
C
C.UTF-8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IL
en_IL.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
POSIX
ru_RU.utf8
ru_UA.utf8
$ locale -m
ANSI_X3.110-1983
ANSI_X3.4-1968
...
UTF-8
VIDEOTEX-SUPPL
VISCII
WIN-SAMI-2
WINDOWS-31J
$
[download]

Thanks for your comment

In reply to create clone script for utf8 encoding by Aldebaran

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.