in reply to Warning: Unicode bytes!

use bytes is almost never a good idea; in essence, it tells perl to consider any strings that perl knows are utf8 encoded as if each byte of the encoded form were a separate character. It's a relic of 5.6's failed approach to unicode, IMO. Leave it off, and so long as you have no exposure any data perl thinks is utf8 encoded, you will have no compatibility problems. The only semi-invisible place utf8 may creep in with newer perls (5.8.1+) is if you have a source file that is UTF-16 encoded, with proper byte order marks at the beginning; in such a file, perl will have literal strings that contain high-bit characters encoded as utf8.

Replies are listed 'Best First'.
Re: Re: Warning: Unicode bytes!
by BrowserUk (Patriarch) on Apr 26, 2004 at 00:49 UTC

    With respect, you are wrong! I  know what data my scalars contain, and none of it is, nor could ever be mistakable, for unicode data.

    Any determination by perl, that IT know's better than I, is a guess--and a wrong guess! For perl to guess, against my explicit instruction to the contrary, is also wrong.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
      To a human, it couldn't be mistaken. However, bits 1101 1100 1111 0111 (or whatever) may mean AX or it could mean the moon character in Chinese. That's all the contribution I have, regardless of PDF::Template's claim of Unicode compatibility. :-)

      ------
      We are the carpenters and bricklayers of the Information Age.

      Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

        That's precisely the point. As I said in the OP, the scalars in question contain arbitrary binary data, not text in any form of encoding, hence my use of use bytes, which works admirably well for index, rindex and anything else that doesn't use the regex engine.

        However, the search criteria I need to use lends itself nicely to using a regex, and under most circumstances of data and search values, works perfectly. But every now and again I was getting mismatches, and spent ages investigating both how the data and the search terms were constructed, before the realisation dawned.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
      With respect, you are wrong! I know what data my scalars contain, and none of it is, nor could ever be mistakable, for unicode data.
      Then you have no reason to say "use bytes".
      Any determination by perl, that IT know's better than I, is a guess--and a wrong guess! For perl to guess, against my explicit instruction to the contrary, is also wrong.
      Perl shouldn't guess; it should only flag as utf8 what you have (somehow or other) told it is utf8. use bytes doesn't do what you think; if anything, it will (in the presence of utf8 data) make things worse by exposing you not just to unicode characters but to the bytes that make up their encoding.