Re: Re: Re: Re: Re: Re: Warning: Unicode bytes!

Finding a reasonably compact example to demonstrate the problem has (I think) allowed me to clarify where the problem lies.

Perl -v == 5.8.3, though 5.8.2 and 5.8.1 also suffer the same problem.

The scalars I am searching contain packed binary (numeric) data. To avoid the need to unpack large volumes of data before looking to see if certain values exist within the scalar, I was converting the search value to it's binary representation and then searching for that using index. Obviously, to avoid mismatches, once the search term is located, it is necessary to check that the match occured at a boundary appropriate to the size of the packed elements. Eg. If the scalar contains 0x00ff, 0xff00 & 0xffff in that (non-byte swapped) order, then when searching for 0xffff, you have to check that the position at which it is found is word-aligned, in order to not fall for the false hit at the zero-based index position 1.

$s = x'00ffff00ffff';
#       [ffff] this a non word-aligned false hit
#             [ffff] Word-align true hit
[download]

However, testing for alignment using index requires putting the search into a loop to skip over false hits and detect later true ones. It's also necessary to avoid arbitrary abuttments of binary bytes being misinterpreted as unicode data, hence the use of use bytes;

It struck me that rather than have a loop to test the alignment and continue after misaligned matches, I could move the search and alignment test into the regex engine.

print "'$1' found at ", pos( $s ) - 4 
    while $s =~ m[(?<=^.{4})*?(....)]g;

'the ' found at  0
'quic' found at  4
'k br' found at  8
'own ' found at  12
'fox ' found at  16
'jump' found at  20
's ov' found at  24
'er t' found at  28
'he l' found at  32
'azy ' found at  36
[download]

Unfortunately, it seems that use bytes is not honoured by the regex engine, at least as far as the numbers in repetition modifiers are concerned.

#! perl -slw
use strict;
use bytes;

my $bindata = pack 'N*', 0000 .. 4000;

for my $n ( 16_000_000 .. 17_000_000 ) {
    no warnings;
    
    my $bin = pack 'N', $n;
    
    my $p = -1;
    while( ( $p = index( $bindata, $bin, $p+1 ) ) >= 0 ) {
        if( not $p % 4 ) {
            print "\nindex found $n at $p";
        }
        else{
            print "\nindex found $n at (non % 4 == 0) $p";
        }
    }
    
    if( $bindata =~ m[(?<=^.{4})*?\Q$bin\E]g ) {
        print "regex found $n at ", pos( $bindata ) - length( $bin );
    }
}

__END__
P:\test>test2

index found 16056320 at (non % 4 == 0) 982
regex found 16056320 at 982

index found 16121856 at (non % 4 == 0) 986
regex found 16121856 at 986

index found 16187392 at (non % 4 == 0) 990
regex found 16187392 at 990

index found 16252928 at (non % 4 == 0) 994
regex found 16252928 at 994

index found 16318464 at (non % 4 == 0) 998
regex found 16318464 at 998

index found 16384000 at (non % 4 == 0) 1002
regex found 16384000 at 1002

index found 16449536 at (non % 4 == 0) 1006
regex found 16449536 at 1006

index found 16515072 at (non % 4 == 0) 1010
regex found 16515072 at 1010

index found 16580608 at (non % 4 == 0) 1014
regex found 16580608 at 1014

index found 16646144 at (non % 4 == 0) 1018
regex found 16646144 at 1018

index found 16711680 at (non % 4 == 0) 1022
regex found 16711680 at 1022

index found 16777216 at (non % 4 == 0) 7

index found 16777216 at (non % 4 == 0) 1026
regex found 16777216 at 7

index found 16777217 at (non % 4 == 0) 1031
regex found 16777217 at 1031

index found 16777218 at (non % 4 == 0) 2055
regex found 16777218 at 2055

index found 16777219 at (non % 4 == 0) 3079
regex found 16777219 at 3079

index found 16777220 at (non % 4 == 0) 4103
regex found 16777220 at 4103

index found 16777221 at (non % 4 == 0) 5127
regex found 16777221 at 5127

index found 16777222 at (non % 4 == 0) 6151
regex found 16777222 at 6151

index found 16777223 at (non % 4 == 0) 7175
regex found 16777223 at 7175

index found 16777224 at (non % 4 == 0) 8199
regex found 16777224 at 8199

index found 16777225 at (non % 4 == 0) 9223
regex found 16777225 at 9223

index found 16777226 at (non % 4 == 0) 10247
regex found 16777226 at 10247

index found 16777227 at (non % 4 == 0) 11271
regex found 16777227 at 11271

index found 16777228 at (non % 4 == 0) 12295
regex found 16777228 at 12295

index found 16777229 at (non % 4 == 0) 13319
regex found 16777229 at 13319

index found 16777230 at (non % 4 == 0) 14343
regex found 16777230 at 14343

index found 16777231 at (non % 4 == 0) 15367
regex found 16777231 at 15367

index found 16842752 at (non % 4 == 0) 1030
regex found 16842752 at 1030

index found 16908288 at (non % 4 == 0) 1034
regex found 16908288 at 1034

index found 16973824 at (non % 4 == 0) 1038
regex found 16973824 at 1038
[download]

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

Comment on Re: Re: Re: Re: Re: Re: Warning: Unicode bytes! Select or Download Code

Replies are listed 'Best First'.
Re^7: Warning: Unicode bytes! by hv (Prior) on Apr 26, 2004 at 09:38 UTC
Good, finally some code. :) The problem lies in the first part of your regexp: `m[ (?<= ^ .{4} )? ... ]x` [download] The `(?<= ... )` construct is an assertion, so matching it "zero or more times" is the same as matching it zero or one times, and in all these cases it is matching zero times: that is, the text doesn't* follow (beginning of string followed by 4 characters (bytes)). I'm not sure why you turned off warnings within the block, but the "matches null string many times" warning was a (not very helpful) indication of this. In any case, you cannot use variable-width matches inside a lookbehind, so if you want to stick with this approach, I would suggest something like this: `if ($bindata =~ m[ ^ (?: .{4} )*? \Q$bin\E ]x) { print "regex found $n at ", pos( $bindata ) - length( $bin ); }` [download] Hugo	[reply] [d/l] [select]
Re: Re^7: Warning: Unicode bytes! by Anomynous Monk (Scribe) on Apr 26, 2004 at 15:44 UTC
Hugo, I'm curious to know if you can think of any reason to `use bytes` in 5.8.4 and onward? My understanding is that utf8 is treated like tainted data: if you don't introduce any, it won't rear its ugly head.	[reply] [d/l]
Re^9: Warning: Unicode bytes! by tye (Sage) on Apr 26, 2004 at 17:46 UTC
I don't find it that hard to come up with cases where I'd want to look at the bytes used to represent some UTF-8 string. Probably these could be done by unsetting the UTF-8 bit on the string (or on a copy of it), but there being more than one way is Perlish. For example, I might just want to know the storage size of a UTF-8 string. Perhaps I have an algorithm that compresses using the concepts of bytes but I want it to "just work" when given a string, whether it is UTF-8 or not. Perhaps I want to transmit a UTF-8 string over a system that has problems with some specific bytes and I want to check for those bytes. Perhaps I want to uuencode a UTF-8 string. Perhaps I need to compute a byte-based checksum of a UTF-8 string. - tye	[reply]
Re^9: Warning: Unicode bytes! by hv (Prior) on Apr 26, 2004 at 15:56 UTC
I certainly wouldn't rule it out: I can imagine there are times you'd want to peek at the internal encoding of a string. However the only reason I can think of off the top of my head is to investigate a problem that you think may be a bug in perl, and there are other hammers I'd usually grab first in such cases (such as Devel::Peek, `perl -Dxxx` and the hammer-of-hammers gdb). Hugo	[reply] [d/l]
Re: Re^7: Warning: Unicode bytes! by BrowserUk (Patriarch) on Apr 26, 2004 at 20:01 UTC
Thanks Hugo. That indeed fixes my problem, with the emphasis on "my". Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l]