How to set the UTF8 flag?

dissident has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to set the UTF8 flag? by ysth (Canon) on Aug 18, 2025 at 03:06 UTC
`utf8::is_utf8($string)` will tell you whether a string is stored as utf8 characters or single byte characters (no require/use needed). And `utf8::upgrade($string)` will convert a string stored as single byte characters to being stored as utf8. But that's not usually what you want; you want a layer on the filehandle that will convert whichever form is being output to utf8 (or whatever other encoding you choose). You can set this with open or after the file is opened with binmode. But some actual sample code/data would be very helpful; when you say "writing two characters for every Unicode byte" it makes me think you have some misconceptions that we could help clear up.	[reply] [d/l] [select]
Re^2: How to set the UTF8 flag? by dissident (Beadle) on Aug 18, 2025 at 05:28 UTC
Great thanks! When searching in the web, I only found some (obviously outdated) information that there would not exist reliable ways to check the UTF-8 bit. So is_utf8() was exactly what I needed to circle in the problems'cause. Turned out that HTTP::Tiny does not support Unicode, just raw text. Thus the issue was resolved by UTF-8-encoding its byte string data through decode().	[reply]
Re^3: How to set the UTF8 flag? by haj (Vicar) on Aug 18, 2025 at 09:43 UTC
Turned out that HTTP::Tiny does not support Unicode, just raw text. That's correct. The statement in the documentation of HTTP::Tiny might deserve a more prominent representation: Content data in the request/response is handled as "raw bytes". Any encoding/decoding (with associated headers) are the responsibility of the caller.	[reply]
Re^3: How to set the UTF8 flag? by ikegami (Patriarch) on Aug 18, 2025 at 15:55 UTC
Turned out that HTTP::Tiny does not support Unicode, just raw text. HTTP has no concept of encoding. It's just a file transfer protocol. By definition, text files don't have an encoding defined within, so HTTP headers can be used to communicate the encoding of the text file. But that doesn't mean that the HTTP agent should automatically decode the file. And it that doesn't apply to binary files such as XHTML. Even modern HTML is really a binary file.	[reply]
Re: How to set the UTF8 flag? by NERDVANA (Priest) on Aug 18, 2025 at 15:38 UTC
There are better explanations out there, but the short welcome-to-unicode-in-perl is: You, the programmer, are responsible for knowing the encoding of all scalars within your program, usually by reading the documentation of every API you use and knowing the flags on your database connections and etc. There is an internal utf-8 flag on every scalar, but it does not mean what you want it to mean. It is an implementation detail and you should never need to look at it unless you're writing C code. You must never concatenate string-of-bytes with string-of-unicode. Convert them to matching encodings before concatenating. If you have a string which you know to be encoded as UTF-8 bytes, and you need to pass it to a function that expects a unicode string, call `decode('UTF-8', $octets, Encode::FB_CROAK)` to get a string of characters. If you have a string which you know contains unicode characters, and you need to pass it to a function that expects bytes, call `encode('UTF-8', $characters, Encode::FB_CROAK)` to get a string of bytes. You can modify the string in-place with utf8::encode and utf8::decode, but beware that now you may have changed assumptions about what is in that string if it can be seen by other parts of the program. (such as encoding or decoding hash entries or global variables) If you have a string and you don't know whether it is unicode or bytes, stop and refactor your program until you do know. If you have a string and you really can't know whether it is unicode or bytes and you don't have time to refactor your program, call decode in a loop like `while (utf8::decode($s)) {}` which results in you definitely having a unicode string. There is a tiny chance that you break some unlikely sequence of characters that probably won't ever be seen in a real string of unicode. Now that you know you have a unicode string, you can call a single encode() to get the bytes you need.	[reply] [d/l] [select]
Re^2: How to set the UTF8 flag? by ikegami (Patriarch) on Aug 18, 2025 at 15:49 UTC
100% agree. From that, we can conclude that `is_utf8` is at best a debugging tool for XS modules and Perl itself. That's not what the OP was doing, so that's not what they should have been using. They should have been inspecting the contents of the string. To inspect the contents of a string without creating new encoding issues, one can use `sprintf "%vX"`. I'm glad to learn the OP's solution was to use `decode` and not anything related to the UTF-8 flag.	[reply] [d/l] [select]
Re^2: How to set the UTF8 flag? by harangzsolt33 (Deacon) on Aug 19, 2025 at 14:25 UTC
There is an internal utf-8 flag on every scalar, but it does not mean what you want it to mean. What does it mean then? And how do we access this flag from Perl? Is it just a boolean value or an integer rather? I can think of 6 different states that it could assume: 1) plain 8-bit ASCII, 2) UTF8, 3) UTF16BE, 4) UTF16LE, 5) UTF32BE, and 6) UTF32LE	[reply]
Re^3: How to set the UTF8 flag? by ikegami (Patriarch) on Aug 19, 2025 at 14:39 UTC
Many—including the OP, apparently—assume it indicates whether the characters^[1] of the string are Code Points or bytes. It does not. It's a bit that indicates the internal storage format of the string. When 0, the string is stored in the "downgraded" format. The characters are stored as an array of C `char` objects. When 1, the string is stored in the "upgraded" format. The characters—whatever they may be—are encoded using utf8 (not UTF-8). Being internal, you have no reason to access it unless debugging an XS module (which must deal with the two formats) or Perl itself. In such cases, you can use aforementioned `utf8::is_utf8` or Devel::Peek's `Dump`. C code has access to the similar `SvUTF8` and `sv_dump`. I define character as an element of a string as returned by `substr( $_, $i, 1 )` or `ord( substr( $_, $i, 1 ) )`, whatever the value means.	[reply] [d/l] [select]
Re^4: How to set the UTF8 flag? by harangzsolt33 (Deacon) on Aug 19, 2025 at 19:35 UTC
Re^5: How to set the UTF8 flag? by ikegami (Patriarch) on Aug 19, 2025 at 21:13 UTC
Some notes below your chosen depth have not been shown here
Re^3: How to set the UTF8 flag? by Corion (Patriarch) on Aug 19, 2025 at 14:28 UTC
The part you quoted continues onwards with: It is an implementation detail and you should never need to look at it unless you're writing C code. So, you should not access this flag from Perl and also not concern yourself with the value.	[reply]
Re^3: How to set the UTF8 flag? by hippo (Archbishop) on Aug 19, 2025 at 15:13 UTC
And how do we access this flag from Perl? You should not. perlunifaq is pretty clear when it says Please, unless you're hacking the internals, or debugging weirdness, don't think about the UTF8 flag at all. 🦛	[reply]
Re^3: How to set the UTF8 flag? by NERDVANA (Priest) on Aug 23, 2025 at 16:09 UTC
Well, I was putting words in the reader's mouth, but I (and seemingly most other programmers) would like it if perl were tracking which scalars are officially intended as a string of Unicode characters, and which scalars are plain bytes. I would like to have this so that I can make my modules "DWIM" and just magically do the right thing when handed a parameter. Unfortunately, the way Unicode support was added to Perl doesn't allow for this distinction. Perl added unicode support on the assumption that the author would keep track of which scalars were Unicode Text and which were not. It just so happens that when perl is storing official Unicode data, and the characters fall outside of the range of 0-255, it uses a loose version of UTF-8 to store the values internally. People hear about this (because it was fairly publicly documented and probably shouldn't have been) and think "well, there's the indication of whether the scalar is intended to be characters or not!". But that's a bad assumption, because there are cases where Perl stores Unicode characters in the 127-255 range as plain bytes, and cases where perl upgrades your string of binary data to internal UTF-8 when you never intended those bytes to be Unicode at all. The internal utf8 flag usually matches whether the scalar was intended to be Unicode Text, but if you try to rely on that you'll end up with bugs in various edge cases, and then blame various core features or module authors for breaking your data, when it really isn't their fault. This is why any time the topic comes up, the response is a firm "you must keep track of your own encodings" and "pay no attention to the utf8 flag". Because any other stance on the matter results in chaos, confusion, and bugs.	[reply]
Re: How to set the UTF8 flag? by jeffenstein (Hermit) on Aug 18, 2025 at 07:43 UTC
It sounds like maybe it's being double-encoded. The only reliable way I know of is to keep track of what is in utf8 and what is in bytes, and be sure to encode and decode while passing things into / out of the code you wrote. There is the utf8::all module that will usually do what you mean, if you understand where it can go wrong. I think perlunicook is the best practical guide, but it can be hard to find as the other docs (perluniintro, etc.) don't tell you about this one. 🐪	[reply]
Re: How to set the UTF8 flag? by dissident (Beadle) on Aug 20, 2025 at 12:39 UTC
Thank you all again for the wealth of information. :-) I used Data::Dumper to examine the string contents. You can see that the string is in Perls internal UTF-8 format if the umlauts, accents etc are stored as 1-byte code points (left column in this table: https://www.utf8-chartable.de/ ) and not in the 2-byte UTF-8 format with C2/C3 as first byte. is_utf8 seems to me just an indicator flag that Perl has either created a variable natively in UTF-8, or it has done a conversion (decode). It does not mean that the data is actually in Perls native format! One of the servers I use returns data which has to be decode()-d twice to become valid UTF-8 data. So after the first decode() is_utf8 returns true, even though the data still needs another decode() run to become ungarbled. Just this to emphasize the fact that one should not blindly rely on what is_utf8 says.	[reply]