in reply to Re^3: Add 1 to an arbitrary-length binary string
in thread Add 1 to an arbitrary-length binary string

”… characters that are not octets…”

That confuses me a bit. What else would they be made of?

say unpack( "U*", "a"); printf("%04X\n", unpack('W*', decode_utf8("a"))); say join " ", unpack( "U*", "😎"); printf("%04X\n", unpack('W*', decode_utf8("😎"))); __END__ 97 0061 240 159 152 142 1F60E

I read it like this: ”a” - code point 0061 - is one octet and 😎 - code point 1F60E - is four octets long.

«The Crux of the Biscuit is the Apostrophe»

Replies are listed 'Best First'.
Re^5: Add 1 to an arbitrary-length binary string
by hv (Prior) on Nov 16, 2023 at 19:26 UTC

    Exactly as I said in the immediately following part of that sentence: Unicode characters with codepoints greater than 255.

    The string "\x{61}\x{1f60e}" has a length of two, it consists of two characters. Its internal representation happens to consist of 5 octets, but any time you have to care about the internal representation is an example of the Unicode bug.

    The string "\x{61}\x{ff}" may be stored internally as either 2 or 3 octets; however it also has a length of two, consists of two characters, and will be incremented by my example code to the string "\x{62}\x{0}", regardless of the internal representation.