comment on

@benizi: Yes, my display is utf8, I get 0000000 303 244 \n as output.

@tye: The test cases are condensed versions of a script I'm writing. So the same happens in script-form (with 'use utf8' on its own line). In the script the strings were also first assigned to variables. But I didn't check them with unpack if you meant that. I also piped into files and then checked the content of the file.

@graff: I got exactly the same output as you did. Didn't know the -C flag. Here results of further experiments (c3 a4 is the utf8 codepoint of ä, e4 is latin1 ä):

echo ä | od -t x1
c3 a4 0a   <--- utf
perl -e '$f=<>;  print "$f"," \xe4"," \xc3\xa4"," ä\n" ' | od -t x1
ä        <--- my input to the <>
c3 a4 0a 20 e4 20 c3 a4 20 c3 a4 0a  <-- utf iso utf utf
perl -CI -e '$f=<>;  print "$f"," \xe4"," \xc3\xa4"," ä\n" ' | od -t x
+1
ä
e4 0a 20 e4 20 c3 a4 20 c3 a4 0a  <-- iso iso utf utf
perl -CO -e '$f=<>;  print "$f"," \xe4"," \xc3\xa4"," ä\n" ' | od -t x
+1
ä
c3 83 c2 a4 0a 20 c3 a4 20 c3 83 c2 a4 20 c3 83 c2 a4 0a  <-- utfgarba
+ge utf utfgarbage utfgarbage
perl -CS -e '$f=<>;  print "$f"," \xe4"," \xc3\xa4"," ä\n" ' | od -t x
+1
c3 a4 0a 20 c3 a4 20 c3 83 c2 a4 20 c3 83 c2 a4 0a   <-- utf utf utfga
+rbage utfgarbage
[download]

This shows that the internal representation is in iso and it expects iso input and output. The 'ä' in the script is therefore not recognised as an 'ä' but as two iso chars and consequently can't get uppercased.

Now when I tried the same with 'use utf8;' (to cut a long story short), I found out that it really only changes interpretation of literal script 'ä' without changing internal representation or any IO.

Which means 'use utf8' works correctly, but somewhere there's a documentation and installation deficiency.

It should be documented that 'use utf8;' should not be used on utf8 machines without an additional switch -CS.

Furthermore on utf8 machines -CS should be enabled by default. Otherwise scripts written on iso machines break on utf8 machines and vice versa. I don't fancy changing all my scripts to include the -CS in the first line.

Another thought: If 'use encoding utf8' changes IO formats like the -CS switch, using it would break backward compatibility to iso machines, which is not that desirable. Ideally the perl interpreter should know where it is running and handle the script accordingly. Which brings us back to locale, which sadly seems to be ignored at the moment.

In reply to Re: bug in utf8 handling? by jethro
in thread bug in utf8 handling? by jethro

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.