borisz has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

my utf8 data often lose the utf8 flag. This happened when the data is stored in a database or when the data is processed by some other modules XS code.

When I use the data again in conjunction with another utf8 data, the dataset is updated again to utf8. With the result, that I get wrong results. Since my new dataset is partial updated to utf twice.

To overcome this, I update my data from the DB with Encode::_utf8_on. $utf = Encode::decode('utf8', $data) does also work, but I like to avoid the extra copy of $data.

Now to the question, how do you do that? And what is the recommened way?
Boris

Replies are listed 'Best First'.
Re: Recommended way to set the utf8 flag without altering the data
by bart (Canon) on Jun 27, 2004 at 11:49 UTC
    For perl 5.6.x, thus including UTF-8 support but not coming with Encode, you can use pack. In the following, the first character ("C" or "U") indicates the type of packed string, which will be used for the whole string, the second is a character/byte count of zero. "a*" packs the actual string data.
    • To mark the bytes as UTF-8:
      $utf8 = pack 'U0a*', $raw;
    • To mark an UTF-8 string as raw bytes:
      $raw = pack 'C0a*', $utf8;

    Actually, that will still work on perl 5.8.x, though the few (well hidden) functions in Encode are a valid alternative. See Messing with Perl's Internals. Somehow I get the feeling they don't like you to mess with this, yourself...

      Thats a good tip too, I compared the three alternate forms and here are the results:
      Rate decode pack _utf8_on decode 57971/s -- -93% -95% pack 781250/s 1248% -- -30% _utf8_on 1111111/s 1817% 42% --
      Boris
Re: Recommended way to set the utf8 flag without altering the data
by theorbtwo (Prior) on Jun 26, 2004 at 15:56 UTC

    Um, am I missing something? What's wrong with Encode::_utf8_on?


    Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).

      The docs from _utf8_on start with
      _utf8_on(STRING) [INTERNAL] Turns on the UTF-8 flag in STRING. ....
      This implies that this function is not the recommended way. Or at leat, that it may change sometime. Another sign to avoid the usage is to me, that it starts with a underscore. This means 'private function' to me.
      Boris

        Indeed. The docs say, a scant six lines (exact figure will vary depending on renderer and font size, of course) above that, "The following API uses parts of Perl's internals in the current implementation. As such, they are efficient but may change."... but I wouldn't worry too much.

        These are indeed not part of the public API of the module... but what does that mean, really? It means that they may change without notice. When, is the question, though. They clearly won't change without you upgrading the module, or perl itself. This means that you should have fair warning before they change.

        But /will/ they change, even then? They've given you fair warning that they may change it. Will they? I doubt it. First off, Dan Kogi isn't the sort of person (I say, at a guess) to lightly break backwards compatablity -- even when he's given you far warning that he might do so. The function is documented, even if it warns you in the same breath. But more importantly, I don't forsee a reason for it to change. Perl's unicode handling model is very unlikely to change in the near future such that setting and getting the value of the utf8 flag will no longer become a meaningful thing to do. (Such a change would, in fact, be greatly desirable, but is very unlikely before, at the very least, 5.12.)

        /me hopes that wasn't too rambily or heretical.


        Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).