This was executed on Suse 10.0 and Ubuntu 6.06 with perl v5.8.7, completely 'utf8isized'.

Does that mean you were using a utf8-aware xterm window (uxterm, gnuterm, or some such)? If perl really prints utf8 data to a tty that isn't set up to "do the right thing" with utf8 encoded characters, there's no telling what the output might look like.

The sort of problem you're reporting is bound to be some side issue, not perl itself -- e.g. locale settings, as suggested by tye, or the kind of display window you're using, etc. It could also be a misunderstanding about the circumstances that induce perl to print utf8-encoded characters through an output file handle.

I prefer to test these sorts of things with explicit code point values (I rarely try to put literal encoded characters into a script) and explicit encoding layers on the relevant file handle(s) (using either binmode or three-arg open).

If you want to rely on "default behaviors", you do need to experiment heavily on what those behaviors entail, and the experiments will need to include things like the shell environment, the display application, available fonts, ...

For the sake of confirming the behavior of the "uc" function on utf8 strings, I'd try it like this (with a utf8 capable terminal window):

perl -CS -e 'print "a\xe4m\n"' | perl -CS -pe 'print; $_=uc'
For me, that prints two lines: "aäm" followed by "AÄM" (which I am posting here as utf8 iso-8859-1 -- if you don't see exactly three letters in each string, with two dots over the middle one, set your browser to use utf8 that).

(In the absence of a utf8 display, I'd pipe the output to some other process that would "hexify" the byte stream, so that I could confirm it against a code chart.)


In reply to Re: bug in utf8 handling? by graff
in thread bug in utf8 handling? by jethro

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.