in reply to Re^3: UTF8/Unicode Confusion
in thread UTF8/Unicode Confusion

Well, the Dump outputs show that the function is correctly returning the unicode character 0xa5; it's just that the internal encoding happens not to be utf8

It does? What am I missing about the second dump, the one from 5.8.4?

-------------- 5.8.4 -------------- SV = PV(0x44c3d64) at 0x10590f4 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x450ab24 "\245"\0 CUR = 1 LEN = 2

That looks like perl is tossing away half of the bytes long before I returns it to any output. I don't think it's a problem with how the output is interpreted, just the fact that the output is half as wide as it should be (5.8.4 tossed away the missing \302)

Replies are listed 'Best First'.
Re^5: UTF8/Unicode Confusion
by ysth (Canon) on Mar 21, 2005 at 05:39 UTC
    Perl is not tossing away half the bytes; perl will store characters either as one byte per character (making the character 0x00A5 be represented as "\245" aka "\xa5"), or in utf8 form, with 1-13 bytes per character (with 0x00A5 represented in two characters, "\302\245"). What kind of storage is used is represented by the UTF8 flag, which you will see on after the utf8::upgrade and off prior to it.

    If you have an output filehandle that you want to receive only the utf8 encoding, use binmode as suggested above or perl's -C switch (see perlrun).

      Well, that was my point. I have no control of where the data came from (Locale::Currency::Format), nor where it is going for or how it is outputted (AxKit).

      With those two facts in hand, I fall back to one of my original questions: it the utf8::upgrade solution an acceptable one?

        Well, I have no idea what AxKit is, but if you are feeding the data to it it should tell you what encoding it wants. utf8::encode() would be one way to force utf8-encoding, yes, but if you are sending the data via a filehandle, applying a utf8 layer to the filehandle would be better. However, if AxKit is a perl module whose functions you are calling and passing data, it should take your \xa5 whether it is utf8 encoded or not.