Finding a reasonably compact example to demonstrate the problem has (I think) allowed me to clarify where the problem lies.
Perl -v == 5.8.3, though 5.8.2 and 5.8.1 also suffer the same problem.
The scalars I am searching contain packed binary (numeric) data. To avoid the need to unpack large volumes of data before looking to see if certain values exist within the scalar, I was converting the search value to it's binary representation and then searching for that using index. Obviously, to avoid mismatches, once the search term is located, it is necessary to check that the match occured at a boundary appropriate to the size of the packed elements. Eg. If the scalar contains 0x00ff, 0xff00 & 0xffff in that (non-byte swapped) order, then when searching for 0xffff, you have to check that the position at which it is found is word-aligned, in order to not fall for the false hit at the zero-based index position 1.
$s = x'00ffff00ffff'; # [ffff] this a non word-aligned false hit # [ffff] Word-align true hit
However, testing for alignment using index requires putting the search into a loop to skip over false hits and detect later true ones. It's also necessary to avoid arbitrary abuttments of binary bytes being misinterpreted as unicode data, hence the use of use bytes;
It struck me that rather than have a loop to test the alignment and continue after misaligned matches, I could move the search and alignment test into the regex engine.
print "'$1' found at ", pos( $s ) - 4 while $s =~ m[(?<=^.{4})*?(....)]g; 'the ' found at 0 'quic' found at 4 'k br' found at 8 'own ' found at 12 'fox ' found at 16 'jump' found at 20 's ov' found at 24 'er t' found at 28 'he l' found at 32 'azy ' found at 36
Unfortunately, it seems that use bytes is not honoured by the regex engine, at least as far as the numbers in repetition modifiers are concerned.
#! perl -slw use strict; use bytes; my $bindata = pack 'N*', 0000 .. 4000; for my $n ( 16_000_000 .. 17_000_000 ) { no warnings; my $bin = pack 'N', $n; my $p = -1; while( ( $p = index( $bindata, $bin, $p+1 ) ) >= 0 ) { if( not $p % 4 ) { print "\nindex found $n at $p"; } else{ print "\nindex found $n at (non % 4 == 0) $p"; } } if( $bindata =~ m[(?<=^.{4})*?\Q$bin\E]g ) { print "regex found $n at ", pos( $bindata ) - length( $bin ); } } __END__ P:\test>test2 index found 16056320 at (non % 4 == 0) 982 regex found 16056320 at 982 index found 16121856 at (non % 4 == 0) 986 regex found 16121856 at 986 index found 16187392 at (non % 4 == 0) 990 regex found 16187392 at 990 index found 16252928 at (non % 4 == 0) 994 regex found 16252928 at 994 index found 16318464 at (non % 4 == 0) 998 regex found 16318464 at 998 index found 16384000 at (non % 4 == 0) 1002 regex found 16384000 at 1002 index found 16449536 at (non % 4 == 0) 1006 regex found 16449536 at 1006 index found 16515072 at (non % 4 == 0) 1010 regex found 16515072 at 1010 index found 16580608 at (non % 4 == 0) 1014 regex found 16580608 at 1014 index found 16646144 at (non % 4 == 0) 1018 regex found 16646144 at 1018 index found 16711680 at (non % 4 == 0) 1022 regex found 16711680 at 1022 index found 16777216 at (non % 4 == 0) 7 index found 16777216 at (non % 4 == 0) 1026 regex found 16777216 at 7 index found 16777217 at (non % 4 == 0) 1031 regex found 16777217 at 1031 index found 16777218 at (non % 4 == 0) 2055 regex found 16777218 at 2055 index found 16777219 at (non % 4 == 0) 3079 regex found 16777219 at 3079 index found 16777220 at (non % 4 == 0) 4103 regex found 16777220 at 4103 index found 16777221 at (non % 4 == 0) 5127 regex found 16777221 at 5127 index found 16777222 at (non % 4 == 0) 6151 regex found 16777222 at 6151 index found 16777223 at (non % 4 == 0) 7175 regex found 16777223 at 7175 index found 16777224 at (non % 4 == 0) 8199 regex found 16777224 at 8199 index found 16777225 at (non % 4 == 0) 9223 regex found 16777225 at 9223 index found 16777226 at (non % 4 == 0) 10247 regex found 16777226 at 10247 index found 16777227 at (non % 4 == 0) 11271 regex found 16777227 at 11271 index found 16777228 at (non % 4 == 0) 12295 regex found 16777228 at 12295 index found 16777229 at (non % 4 == 0) 13319 regex found 16777229 at 13319 index found 16777230 at (non % 4 == 0) 14343 regex found 16777230 at 14343 index found 16777231 at (non % 4 == 0) 15367 regex found 16777231 at 15367 index found 16842752 at (non % 4 == 0) 1030 regex found 16842752 at 1030 index found 16908288 at (non % 4 == 0) 1034 regex found 16908288 at 1034 index found 16973824 at (non % 4 == 0) 1038 regex found 16973824 at 1038
In reply to Re: Re: Re: Re: Re: Re: Warning: Unicode bytes!
by BrowserUk
in thread Warning: Unicode bytes!
by BrowserUk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |