vsespb has asked for the wisdom of the Perl Monks concerning the following question:

I am going to use Unicode::UTF8

Problem that I don't really understand how it would be compatible with perl own unicode implementation

There are the following lines in documentation:
Recognizes all noncharacters regardless of Perl version
It's the author's believe that this UTF-8 implementation is conformant with the Unicode Standard Version 6.0

What will happen if some Unicode 6 characters recognized by this module, but then misinterpreted by perl?

I am interested in understanding this behaviour in at least one possible (even insignificant) edge case (including but not limited to security issues), for all perl 5.8.8+. But I am not sure where to start.

UPD:"at least one" edge case, not "any edge case"
  • Comment on Unicode::UTF8 and perl Unicode compatibility

Replies are listed 'Best First'.
Re: Unicode::UTF8 and perl Unicode compatibility
by kcott (Archbishop) on Aug 31, 2013 at 12:21 UTC
Re: Unicode::UTF8 and perl Unicode compatibility
by ww (Archbishop) on Aug 31, 2013 at 11:48 UTC

    1. Start, if possible, by upgrading to a currently supported version of Perl (5.18 is current; ActiveState lags with 5.16; most others known to me are at 5.18).

    2. Try it with cases you anticipate. See if the output -- to file and to console (since the rendering in those two cases may not match) -- satisfies your needs.

    3. Now get all wild and crazy. Use Perl to walk Unicode::UTF8 thru the character sets with which you need to deal. Do it again and again for major versions (but note that you could go insane trying to test every build of every Perl distro, especially those built by individuals from source with variant options enabled).

    4. When complete -- and it should not take long, unless you have an insanely wide selection of needed sets -- you'll know the answer to your question, rather than having to rely on second hand info of whose validity you will likely have no way (other than the above) to evaluate.

    5. Then, share your newfound knowledge with your own reply to this thread.


    If you didn't program your executable by toggling in binary, it wasn't really programming!

      Use Perl to walk Unicode::UTF8 thru the character sets with which you need to deal
      I am not sure how to walk. What can fail? Regexps? Character classes? String comparsion? Collations? Folder case? Normalization? I am not sure what new Unicode 6 introduced.

        "...how to walk.":   loops and arrays.

        "What can fail?....":   Answering that is with domain of the research plan outlined above... or, stated more simply, 'TITS, try it to see.'

        If I've misconstrued your question or the logic needed to answer it, I offer my apologies to all those electrons which were inconvenienced by the creation of this post.
Re: Unicode::UTF8 and perl Unicode compatibility
by Hansen (Friar) on Sep 01, 2013 at 18:26 UTC

    You will have no compatibility issues. The main difference between Encode's implementation of UTF-8 and Unicode::UTF8's is that Encode uses decoding/encoding functions provided by perl where Unicode::UTF8 has it's own implementation. Unicode::UTF8 provides a consistent behavior across all supported (>= 5.8.1) versions of perl.

    I wrote Unicode::UTF8 because I wanted a fast implementation with a simple api, you can read a comparison with Encode.

    -- chansen
      Great! Thanks!
Re: Unicode::UTF8 and perl Unicode compatibility
by Anonymous Monk on Aug 31, 2013 at 12:27 UTC

    Problem that I don't really understand how it would be compatible with perl own unicode implementation

    Don't worry about it

    What will happen if some Unicode 6 characters recognized by this module, but then misinterpreted by perl?

    This wouldn't happen. Once octets/bytes are decoded into characters, they're characters (codepoints)

    possible insignificant edge case (including but not limited to security issues), for all perl 5.8.8+. But I am not sure where to start.

    If I were you I wouldn't even start :) Why ? Because starting is starting to sound more and more like reinventing-Unicode::UTF8, or tracking-perl-bugs-since-2006-six-decades-ago

    I would stick with 5.18.x

      This wouldn't happen. Once octets/bytes are decoded into characters, they're characters (codepoints)
      So, you think, once it decoded to characters, it will work perfectly without breaking anything?

      What about utf8::valid(). Will it pass? Will it affect anything?

      What if I try encoding back to bytes with Encode::encode("UTF-8" .. ?
      I would stick with 5.18.x
      No, I specified in OP that I need compatibility with any version starting from perl5.8.8.

        :)

        So, you think, once it decoded to characters, it will work perfectly without breaking anything?

        Yes, in so much as once Unicode::UTF8 does its thing its done, your perl takes over (with all that entails)

        What about utf8::valid(). Will it pass? Will it affect anything?

        I think it will "pass" and will not "affect anything", but I don't see how it matters -- by using Unicode::UTF8 you're saying the hell with Encode.pm / utf8.pm , i'll let Unicode::UTF8 take care of everything , so there should be no reason to consult utf8 or Encode

        What if I try encoding back to bytes with Encode::encode("UTF-8" .. ?

        That ought to work fine as well (call me optimistic)

        No, I specified in OP that I need compatibility with any version starting from perl5.8.8.

        Yes, I've read this, I understand, and its why I didn't make jokes :) food for thought: Re: Why upgrade perl?, Re: perldeltas - every perl*delta in one file (pod.lst)