in reply to Re^2: UTF8/Unicode Confusion
in thread UTF8/Unicode Confusion

Well, the Dump outputs show that the function is correctly returning the unicode character 0xa5; it's just that the internal encoding happens not to be utf8. Using utf8::upgrade gets round whatever problem you're having because it converts the internal representation.

The problem must lie in how you're using the returned value. If for example you're just printing it to STDOUT, and if whatever's listening on STDOUT expects utf8 encoding (eg the terminal), then you need to let Perl know that any output on that file handle should be utf8 encoded, eg

$ perl -e 'print chr 0xa5'|od -x 0000000 00a5 $ perl -e 'binmode(STDOUT, ":utf8"); print chr 0xa5'|od -x 0000000 a5c2 $
see perluniintro (in 5.8.x) for more information.

Dave.

Replies are listed 'Best First'.
Re^4: UTF8/Unicode Confusion
by jk2addict (Chaplain) on Mar 21, 2005 at 01:09 UTC
    Well, the Dump outputs show that the function is correctly returning the unicode character 0xa5; it's just that the internal encoding happens not to be utf8

    It does? What am I missing about the second dump, the one from 5.8.4?

    -------------- 5.8.4 -------------- SV = PV(0x44c3d64) at 0x10590f4 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x450ab24 "\245"\0 CUR = 1 LEN = 2

    That looks like perl is tossing away half of the bytes long before I returns it to any output. I don't think it's a problem with how the output is interpreted, just the fact that the output is half as wide as it should be (5.8.4 tossed away the missing \302)

      Perl is not tossing away half the bytes; perl will store characters either as one byte per character (making the character 0x00A5 be represented as "\245" aka "\xa5"), or in utf8 form, with 1-13 bytes per character (with 0x00A5 represented in two characters, "\302\245"). What kind of storage is used is represented by the UTF8 flag, which you will see on after the utf8::upgrade and off prior to it.

      If you have an output filehandle that you want to receive only the utf8 encoding, use binmode as suggested above or perl's -C switch (see perlrun).

        Well, that was my point. I have no control of where the data came from (Locale::Currency::Format), nor where it is going for or how it is outputted (AxKit).

        With those two facts in hand, I fall back to one of my original questions: it the utf8::upgrade solution an acceptable one?