Leo_Yao has asked for the wisdom of the Perl Monks concerning the following question:

After some test, I found if a variable is not encoded in ISO-8859-1, it will be not converted successfully by using utf8::upgrade. Can I say the convert rule is from ISO-8859-1 to UTF8?

  • Comment on What is the convert rule of utf8::upgrade?

Replies are listed 'Best First'.
Re: What is the convert rule of utf8::upgrade?
by Corion (Patriarch) on Feb 25, 2011 at 18:41 UTC

    utf8 is a core module. Can you suggest how we could phrase the documentation for utf8::upgrade differently so it becomes more clear as to what it does and where its limitations lie?

    $num_octets = utf8::upgrade($string)

    Converts in-place the internal representation of the string from an octet sequence in the native encoding (Latin-1 or EBCDIC) to UTF-X. The logical character sequence itself is unchanged. If $string is already stored as UTF-X, then this is a no-op. Returns the number of octets necessary to represent the string as UTF-X. Can be used to make sure that the UTF-8 flag is on, so that \w or lc() work as Unicode on strings containing characters in the range 0x80-0xFF (on ASCII and derivatives).

    Note that this function does not handle arbitrary encodings. Therefore Encode is recommended for the general purposes; see also Encode.

Re: What is the convert rule of utf8::upgrade?
by ikegami (Patriarch) on Feb 25, 2011 at 19:09 UTC

    utf8::upgrade never converts to UTF-8. You want utf8::encode for that.

    In fact, utf8::upgrade and utf8::downgrade don't change the string at all. They change its internal storage format. ("representation", in the docs.) They are useful for workaround around bugs in Perl and XS modules, but they are not encoding or decoding tools.