Re: Re: Re: Re: Re: Warning: Unicode bytes!

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: Re: Warning: Unicode bytes! by BrowserUk (Patriarch) on Apr 26, 2004 at 09:01 UTC
Finding a reasonably compact example to demonstrate the problem has (I think) allowed me to clarify where the problem lies. Perl -v == 5.8.3, though 5.8.2 and 5.8.1 also suffer the same problem. The scalars I am searching contain packed binary (numeric) data. To avoid the need to unpack large volumes of data before looking to see if certain values exist within the scalar, I was converting the search value to it's binary representation and then searching for that using index. Obviously, to avoid mismatches, once the search term is located, it is necessary to check that the match occured at a boundary appropriate to the size of the packed elements. Eg. If the scalar contains 0x00ff, 0xff00 & 0xffff in that (non-byte swapped) order, then when searching for 0xffff, you have to check that the position at which it is found is word-aligned, in order to not fall for the false hit at the zero-based index position 1. `$s = x'00ffff00ffff'; # [ffff] this a non word-aligned false hit # [ffff] Word-align true hit` [download] However, testing for alignment using index requires putting the search into a loop to skip over false hits and detect later true ones. It's also necessary to avoid arbitrary abuttments of binary bytes being misinterpreted as unicode data, hence the use of use bytes; It struck me that rather than have a loop to test the alignment and continue after misaligned matches, I could move the search and alignment test into the regex engine. `print "'$1' found at ", pos( $s ) - 4 while $s =~ m[(?<=^.{4})*?(....)]g; 'the ' found at 0 'quic' found at 4 'k br' found at 8 'own ' found at 12 'fox ' found at 16 'jump' found at 20 's ov' found at 24 'er t' found at 28 'he l' found at 32 'azy ' found at 36` [download] Unfortunately, it seems that use bytes is not honoured by the regex engine, at least as far as the numbers in repetition modifiers are concerned. Read more... (3 kB) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l] [select]
Re^7: Warning: Unicode bytes! by hv (Prior) on Apr 26, 2004 at 09:38 UTC
Good, finally some code. :) The problem lies in the first part of your regexp: `m[ (?<= ^ .{4} )? ... ]x` [download] The `(?<= ... )` construct is an assertion, so matching it "zero or more times" is the same as matching it zero or one times, and in all these cases it is matching zero times: that is, the text doesn't* follow (beginning of string followed by 4 characters (bytes)). I'm not sure why you turned off warnings within the block, but the "matches null string many times" warning was a (not very helpful) indication of this. In any case, you cannot use variable-width matches inside a lookbehind, so if you want to stick with this approach, I would suggest something like this: `if ($bindata =~ m[ ^ (?: .{4} )*? \Q$bin\E ]x) { print "regex found $n at ", pos( $bindata ) - length( $bin ); }` [download] Hugo	[reply] [d/l] [select]
Re: Re^7: Warning: Unicode bytes! by Anomynous Monk (Scribe) on Apr 26, 2004 at 15:44 UTC
Hugo, I'm curious to know if you can think of any reason to `use bytes` in 5.8.4 and onward? My understanding is that utf8 is treated like tainted data: if you don't introduce any, it won't rear its ugly head.	[reply] [d/l]
Re^9: Warning: Unicode bytes! by tye (Sage) on Apr 26, 2004 at 17:46 UTC
Re^9: Warning: Unicode bytes! by hv (Prior) on Apr 26, 2004 at 15:56 UTC
Re: Re^7: Warning: Unicode bytes! by BrowserUk (Patriarch) on Apr 26, 2004 at 20:01 UTC
Thanks Hugo. That indeed fixes my problem, with the emphasis on "my". Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l]