in reply to Re: Re: Re: Re: Re: Re: Warning: Unicode bytes!
in thread Warning: Unicode bytes!

Good, finally some code. :)

The problem lies in the first part of your regexp:

m[ (?<= ^ .{4} )*? ... ]x

The (?<= ... ) construct is an assertion, so matching it "zero or more times" is the same as matching it zero or one times, and in all these cases it is matching zero times: that is, the text doesn't follow (beginning of string followed by 4 characters (bytes)).

I'm not sure why you turned off warnings within the block, but the "matches null string many times" warning was a (not very helpful) indication of this.

In any case, you cannot use variable-width matches inside a lookbehind, so if you want to stick with this approach, I would suggest something like this:

if ($bindata =~ m[ ^ (?: .{4} )*? \Q$bin\E ]x) { print "regex found $n at ", pos( $bindata ) - length( $bin ); }

Hugo

Replies are listed 'Best First'.
Re: Re^7: Warning: Unicode bytes!
by Anomynous Monk (Scribe) on Apr 26, 2004 at 15:44 UTC
    Hugo, I'm curious to know if you can think of any reason to use bytes in 5.8.4 and onward?

    My understanding is that utf8 is treated like tainted data: if you don't introduce any, it won't rear its ugly head.

      I don't find it that hard to come up with cases where I'd want to look at the bytes used to represent some UTF-8 string. Probably these could be done by unsetting the UTF-8 bit on the string (or on a copy of it), but there being more than one way is Perlish.

      For example, I might just want to know the storage size of a UTF-8 string. Perhaps I have an algorithm that compresses using the concepts of bytes but I want it to "just work" when given a string, whether it is UTF-8 or not. Perhaps I want to transmit a UTF-8 string over a system that has problems with some specific bytes and I want to check for those bytes. Perhaps I want to uuencode a UTF-8 string. Perhaps I need to compute a byte-based checksum of a UTF-8 string.

      - tye        

      I certainly wouldn't rule it out: I can imagine there are times you'd want to peek at the internal encoding of a string. However the only reason I can think of off the top of my head is to investigate a problem that you think may be a bug in perl, and there are other hammers I'd usually grab first in such cases (such as Devel::Peek, perl -Dxxx and the hammer-of-hammers gdb).

      Hugo

Re: Re^7: Warning: Unicode bytes!
by BrowserUk (Patriarch) on Apr 26, 2004 at 20:01 UTC

    Thanks Hugo. That indeed fixes my problem, with the emphasis on "my".


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail