Warning: Unicode bytes!

This is just a gentle heads up that may prevent someone else spending a long time trying to track down sporadic mismatches when comparing or searching strings containing arbitrary binary data.

If like me, you read this from the use bytes pod

The use bytes pragma disables character semantics for the rest of the lexical scope in which it appears. no bytes can be used to reverse the effect of use bytes within the current lexical scope.
Perl normally assumes character semantics in the presence of character data (i.e. data that has come from a source that has been marked as being of a particular character encoding). When use bytes is in effect, the encoding is temporarily ignored, and each string is treated as a series of bytes.

to mean that any string comparisons or searches taking place with the auspices of use bytes would be exempt from unicode considerations, they aren't if the regex engine is involved!

Whether this is by design (why?) or oversight (amazing!), it is possible to search a string and get matches at apparently random places that simply defy explanation, until you start looking at the data in terms of characters and not bytes. Very confusing, especially when you've taken the precaution of placing the code in a use bytes block..

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

Comment on Warning: Unicode bytes! Download Code

Replies are listed 'Best First'.
Re: Warning: Unicode bytes! by Anomynous Monk (Scribe) on Apr 26, 2004 at 00:38 UTC
`use bytes` is almost never a good idea; in essence, it tells perl to consider any strings that perl knows are utf8 encoded as if each byte of the encoded form were a separate character. It's a relic of 5.6's failed approach to unicode, IMO. Leave it off, and so long as you have no exposure any data perl thinks is utf8 encoded, you will have no compatibility problems. The only semi-invisible place utf8 may creep in with newer perls (5.8.1+) is if you have a source file that is UTF-16 encoded, with proper byte order marks at the beginning; in such a file, perl will have literal strings that contain high-bit characters encoded as utf8.	[reply] [d/l]
Re: Re: Warning: Unicode bytes! by BrowserUk (Patriarch) on Apr 26, 2004 at 00:49 UTC
With respect, you are wrong! I know what data my scalars contain, and none of it is, nor could ever be mistakable, for unicode data. Any determination by perl, that IT know's better than I, is a guess--and a wrong guess! For perl to guess, against my explicit instruction to the contrary, is also wrong. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply]
Re: Re: Re: Warning: Unicode bytes! by dragonchild (Archbishop) on Apr 26, 2004 at 01:18 UTC
To a human, it couldn't be mistaken. However, bits 1101 1100 1111 0111 (or whatever) may mean AX or it could mean the moon character in Chinese. That's all the contribution I have, regardless of PDF::Template's claim of Unicode compatibility. :-) ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose	[reply]
Re: Re: Re: Re: Warning: Unicode bytes! by BrowserUk (Patriarch) on Apr 26, 2004 at 01:37 UTC
Re: Re: Re: Re: Re: Warning: Unicode bytes! by theorbtwo (Prior) on Apr 26, 2004 at 10:09 UTC
Re: Re: Re: Re: Re: Warning: Unicode bytes! by Anomynous Monk (Scribe) on Apr 26, 2004 at 03:52 UTC
Some notes below your chosen depth have not been shown here
Re: Re: Re: Warning: Unicode bytes! by Anomynous Monk (Scribe) on Apr 27, 2004 at 10:50 UTC
With respect, you are wrong! I know what data my scalars contain, and none of it is, nor could ever be mistakable, for unicode data. Then you have no reason to say "use bytes". Any determination by perl, that IT know's better than I, is a guess--and a wrong guess! For perl to guess, against my explicit instruction to the contrary, is also wrong. Perl shouldn't guess; it should only flag as utf8 what you have (somehow or other) told it is utf8. `use bytes` doesn't do what you think; if anything, it will (in the presence of utf8 data) make things worse by exposing you not just to unicode characters but to the bytes that make up their encoding.	[reply] [d/l]