Re^2: How to set the UTF8 flag?

Replies are listed 'Best First'.
Re^3: How to set the UTF8 flag? by ikegami (Patriarch) on Aug 19, 2025 at 14:39 UTC
Many—including the OP, apparently—assume it indicates whether the characters^[1] of the string are Code Points or bytes. It does not. It's a bit that indicates the internal storage format of the string. When 0, the string is stored in the "downgraded" format. The characters are stored as an array of C `char` objects. When 1, the string is stored in the "upgraded" format. The characters—whatever they may be—are encoded using utf8 (not UTF-8). Being internal, you have no reason to access it unless debugging an XS module (which must deal with the two formats) or Perl itself. In such cases, you can use aforementioned `utf8::is_utf8` or Devel::Peek's `Dump`. C code has access to the similar `SvUTF8` and `sv_dump`. I define character as an element of a string as returned by `substr( $_, $i, 1 )` or `ord( substr( $_, $i, 1 ) )`, whatever the value means.	[reply] [d/l] [select]
Re^4: How to set the UTF8 flag? by harangzsolt33 (Deacon) on Aug 19, 2025 at 19:35 UTC
Oh okay. I don't understand what you mean by utf8 vs UTF-8. Is there a difference? Also, I'm not sure why I got two thumbs down on my question. Is it not allowed to ask questions in here anymore? There must have been some rule changes since 2016 when I first got on this forum.	[reply]
Re^5: How to set the UTF8 flag? by ikegami (Patriarch) on Aug 19, 2025 at 21:13 UTC
utf8 is a Perl-specific extension of UTF-8 capable of encoding any 72-bit value (but it's limited to encoding values the size of UVs in practice). I didn't downvote, but it's probably because you completely fabricated a definition of the flag.	[reply]
Re^6: How to set the UTF8 flag? by harangzsolt33 (Deacon) on Aug 20, 2025 at 01:14 UTC
Re^7: How to set the UTF8 flag? by GrandFather (Saint) on Aug 20, 2025 at 04:19 UTC
Some notes below your chosen depth have not been shown here
Re^7: How to set the UTF8 flag? by ikegami (Patriarch) on Aug 20, 2025 at 03:05 UTC
Re^3: How to set the UTF8 flag? by Corion (Patriarch) on Aug 19, 2025 at 14:28 UTC
The part you quoted continues onwards with: It is an implementation detail and you should never need to look at it unless you're writing C code. So, you should not access this flag from Perl and also not concern yourself with the value.	[reply]
Re^3: How to set the UTF8 flag? by hippo (Archbishop) on Aug 19, 2025 at 15:13 UTC
And how do we access this flag from Perl? You should not. perlunifaq is pretty clear when it says Please, unless you're hacking the internals, or debugging weirdness, don't think about the UTF8 flag at all. 🦛	[reply]
Re^3: How to set the UTF8 flag? by NERDVANA (Priest) on Aug 23, 2025 at 16:09 UTC
Well, I was putting words in the reader's mouth, but I (and seemingly most other programmers) would like it if perl were tracking which scalars are officially intended as a string of Unicode characters, and which scalars are plain bytes. I would like to have this so that I can make my modules "DWIM" and just magically do the right thing when handed a parameter. Unfortunately, the way Unicode support was added to Perl doesn't allow for this distinction. Perl added unicode support on the assumption that the author would keep track of which scalars were Unicode Text and which were not. It just so happens that when perl is storing official Unicode data, and the characters fall outside of the range of 0-255, it uses a loose version of UTF-8 to store the values internally. People hear about this (because it was fairly publicly documented and probably shouldn't have been) and think "well, there's the indication of whether the scalar is intended to be characters or not!". But that's a bad assumption, because there are cases where Perl stores Unicode characters in the 127-255 range as plain bytes, and cases where perl upgrades your string of binary data to internal UTF-8 when you never intended those bytes to be Unicode at all. The internal utf8 flag usually matches whether the scalar was intended to be Unicode Text, but if you try to rely on that you'll end up with bugs in various edge cases, and then blame various core features or module authors for breaking your data, when it really isn't their fault. This is why any time the topic comes up, the response is a firm "you must keep track of your own encodings" and "pay no attention to the utf8 flag". Because any other stance on the matter results in chaos, confusion, and bugs.	[reply]