This is just a gentle heads up that may prevent someone else spending a long time trying to track down sporadic mismatches when comparing or searching strings containing arbitrary binary data.

If like me, you read this from the use bytes pod

The use bytes pragma disables character semantics for the rest of the lexical scope in which it appears. no bytes can be used to reverse the effect of use bytes within the current lexical scope.

Perl normally assumes character semantics in the presence of character data (i.e. data that has come from a source that has been marked as being of a particular character encoding). When use bytes is in effect, the encoding is temporarily ignored, and each string is treated as a series of bytes.

to mean that any string comparisons or searches taking place with the auspices of use bytes would be exempt from unicode considerations, they aren't if the regex engine is involved!

Whether this is by design (why?) or oversight (amazing!), it is possible to search a string and get matches at apparently random places that simply defy explanation, until you start looking at the data in terms of characters and not bytes. Very confusing, especially when you've taken the precaution of placing the code in a use bytes block..


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

In reply to Warning: Unicode bytes! by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.