in reply to Re: Re: Warning: Unicode bytes!
in thread Warning: Unicode bytes!

To a human, it couldn't be mistaken. However, bits 1101 1100 1111 0111 (or whatever) may mean AX or it could mean the moon character in Chinese. That's all the contribution I have, regardless of PDF::Template's claim of Unicode compatibility. :-)

------
We are the carpenters and bricklayers of the Information Age.

Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

Replies are listed 'Best First'.
Re: Re: Re: Re: Warning: Unicode bytes!
by BrowserUk (Patriarch) on Apr 26, 2004 at 01:37 UTC

    That's precisely the point. As I said in the OP, the scalars in question contain arbitrary binary data, not text in any form of encoding, hence my use of use bytes, which works admirably well for index, rindex and anything else that doesn't use the regex engine.

    However, the search criteria I need to use lends itself nicely to using a regex, and under most circumstances of data and search values, works perfectly. But every now and again I was getting mismatches, and spent ages investigating both how the data and the search terms were constructed, before the realisation dawned.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail

      In that case, the question is "why does the scalar have the utf8 flag set". If it came from a filehandle, that question is equvilent to "why is the filehandle set to be in a utf8 encoding". A binmode will probably solve your problems here.

      If it didn't come from a filehandle, or marking the FH as having binary data is not a good thing, you can use $wasutf8 = Encode::_utf8_off($string);

      In case you wondered, my general rule is that giving the runtime more information about what's going (by making sure the utf8 bit is set correctly on scalars, or the encoding is set correctly on filehandles) is better then forcing it to do what you want, when it has other ideas (by using bytes).

      (That's not a hard and fast rule, of course...)


      Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).

      Perhaps you could provide an actual example? I don't think I understand the situation you refer to. I was assuming, perhaps erroneously, that you are running at least 5.8.1; I don't know as much about the utf8-related bugs that afflicted regex matching before then. Even there, I can't imagine the situation you seem to describe unless you actually are introducing utf8 data.

        Finding a reasonably compact example to demonstrate the problem has (I think) allowed me to clarify where the problem lies.

        Perl -v == 5.8.3, though 5.8.2 and 5.8.1 also suffer the same problem.

        The scalars I am searching contain packed binary (numeric) data. To avoid the need to unpack large volumes of data before looking to see if certain values exist within the scalar, I was converting the search value to it's binary representation and then searching for that using index. Obviously, to avoid mismatches, once the search term is located, it is necessary to check that the match occured at a boundary appropriate to the size of the packed elements. Eg. If the scalar contains 0x00ff, 0xff00 & 0xffff in that (non-byte swapped) order, then when searching for 0xffff, you have to check that the position at which it is found is word-aligned, in order to not fall for the false hit at the zero-based index position 1.

        $s = x'00ffff00ffff'; # [ffff] this a non word-aligned false hit # [ffff] Word-align true hit

        However, testing for alignment using index requires putting the search into a loop to skip over false hits and detect later true ones. It's also necessary to avoid arbitrary abuttments of binary bytes being misinterpreted as unicode data, hence the use of use bytes;

        It struck me that rather than have a loop to test the alignment and continue after misaligned matches, I could move the search and alignment test into the regex engine.

        print "'$1' found at ", pos( $s ) - 4 while $s =~ m[(?<=^.{4})*?(....)]g; 'the ' found at 0 'quic' found at 4 'k br' found at 8 'own ' found at 12 'fox ' found at 16 'jump' found at 20 's ov' found at 24 'er t' found at 28 'he l' found at 32 'azy ' found at 36

        Unfortunately, it seems that use bytes is not honoured by the regex engine, at least as far as the numbers in repetition modifiers are concerned.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail