Well, I was putting words in the reader's mouth, but I (and seemingly most other programmers) would like it if perl were tracking which scalars are officially intended as a string of Unicode characters, and which scalars are plain bytes. I would like to have this so that I can make my modules "DWIM" and just magically do the right thing when handed a parameter.

Unfortunately, the way Unicode support was added to Perl doesn't allow for this distinction. Perl added unicode support on the assumption that the author would keep track of which scalars were Unicode Text and which were not.

It just so happens that when perl is storing official Unicode data, and the characters fall outside of the range of 0-255, it uses a loose version of UTF-8 to store the values internally. People hear about this (because it was fairly publicly documented and probably shouldn't have been) and think "well, there's the indication of whether the scalar is intended to be characters or not!". But that's a bad assumption, because there are cases where Perl stores Unicode characters in the 127-255 range as plain bytes, and cases where perl upgrades your string of binary data to internal UTF-8 when you never intended those bytes to be Unicode at all.

The internal utf8 flag *usually* matches whether the scalar was intended to be Unicode Text, but if you try to rely on that you'll end up with bugs in various edge cases, and then blame various core features or module authors for breaking your data, when it really isn't their fault. This is why any time the topic comes up, the response is a firm "you must keep track of your own encodings" and "pay no attention to the utf8 flag". Because any other stance on the matter results in chaos, confusion, and bugs.


In reply to Re^3: How to set the UTF8 flag? by NERDVANA
in thread How to set the UTF8 flag? by dissident

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.