in reply to Re^4: Best Way to Get Length of UTF-8 String in Bytes?
in thread Best Way to Get Length of UTF-8 String in Bytes?

I would agree, the perl implementation is documented to use UTF-8 encoding for one of the two options, and 8-bit chars for the other. It is also explained when each occurs and how they are handled during concatenation, with various options.

Certainly is is less problematic and more maintainable to not count on any subtle details that might shift the meaning.

Hmm, just what is the 8-bit form? If it's "whatever was read in", it might include characters encoded in multiple bytes, using some other code page. So, I would be inclined to feel safe treating the internal length in bytes as the UTF-8 length if I read in the string from a file using UTF-8 encoding, or it was a string literal in a program whose source file used utf8. I think there is also a utility function somewhere to tell you which mode a string is in.

In fact, wouldn't the UTF-8 encoder just check that flag first and realize it's a no-op? So using it would be efficient, if you don't mind copying the string.

  • Comment on Re^5: Best Way to Get Length of UTF-8 String in Bytes?