comment on

perldoc -f uc would tell you "Respects current LC_CTYPE locale if "use locale" in force". Add -Mlocale to your test cases and see if that "fixes" it. Perl doesn't uppercase other than a..z by default, regardless of any utf-8 concerns.

Update: Interesting. My testing confirmed my suspicion (which had been confirmed by the fine manual), however, further testing showed that uc (without -Mlocale) appears to not impact accented letters encoded in Latin-1 but does impact the exact same accented letters if they are encoded in utf-8. Perhaps the fine manual could use an update?

I also see now that you mention -Mlocale not having an impact.

So I'll update my guess to be that Perl doesn't know that your string literals are meant to be utf-8 strings and so it is interpretting them as byte strings (or Latin-1 strings, depending on what you want to call Perl's non-utf-8 strings). The "use utf8;" tells Perl that your string literals are utf-8 and so it upcases correctly and also translates to Latin-1 when writing to STDOUT (since you haven't declared STDOUT as being a utf-8 file handle).

encoding.pm notes that it declares STDOUT to be in a specific encoding. However, it also says that it interprets your source code as being in that encoding, not in utf-8. So that leaves me wondering how your source can be correctly interpretted as utf-8 and correctly interpretted as iso8859. Which also makes me curious if your test cases behave the same if saved to Perl script files instead of being typed into the command line.

I'm also used to holding a belief that Perl's parser reads lines at a time and a pragma that changes how the source code is read won't have any effect until the next line. So whether or not you have a newline after something like "use utf8;" might make a difference and since source code entered on the command line isn't read by perl, perhaps that makes a difference as well.

But mostly I'm tired (but not able to get to sleep) so some of my wondering will likely just leave me wondering why I wasn't wandering sooner.

Update: Aha! Your source code is only using single-byte characters however bytes with the msb set get interpretted differently when those pragmas are used? Actually, I'd expect such "8-bit" characters to cause an error or warning when read after "use utf8;".

You could assign your string literals to variables so that you could determine (in each case) if Perl interpretted the string as utf-8, in order to eliminate some guessing.

- tye

In reply to Re: bug in utf8 handling? (utfm) by tye
in thread bug in utf8 handling? by jethro

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.