in reply to My UTF-8 text isn't surviving I/O as expected

I decided to join the 21st century

Technically, you decided to finally join the late 20th century. The basics of Unicode were started in 1989, UTF-8 was first presented to the public at the USENIX conference in 1993. (Sorry, couldn't resist)

So, welcome to the international club of pain and suffering uh i meant to write "supporters of Umlauts, Linear A(¹), hidden control characters that will confuse your text renderer(²), black Santas(³). And multiple ways of encoding the same character with the same text length but different byte length that are still the same character but need special (and complicated) functions to string-compare them(4). And apparently broken superscripts on PerlMonks(5)"


(¹) Linear A

(²) Unicode control characters

(³) Emoji modifiers and examples in color

(4) Unicode equivalence, also incorrect length of strings with diphthongs

(5) PerlMonks only seems to display superscripts ¹²³ correctly (only tried in post preview), but should really support all the numbers and signs. Unicode Block “Superscripts and Subscripts”

PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
Also check out my sisters artwork and my weekly webcomics
  • Comment on Re: My UTF-8 text isn't surviving I/O as expected

Replies are listed 'Best First'.
Re^2: My UTF-8 text isn't surviving I/O as expected
by choroba (Cardinal) on Nov 25, 2024 at 15:24 UTC
    For me, ⁵ works without problems. Isn't it a browser/font problem?

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re^2: My UTF-8 text isn't surviving I/O as expected
by ibm1620 (Hermit) on Nov 26, 2024 at 01:56 UTC
    Reading Tom Christiansen's sobering post about Unicode was enough to discourage me from trying to become proficient with Unicode. I'm retired so I get to do that :-)

      On the surface, yes, it looks bad. But from my experience, you can cover nearly all cases (like 99.5% or so) by following some simple rules, no matter the encoding:

      • Convert all incoming data to perls internal representation (utf8_decode or similar)
      • Convert all outgoing data to the correct encoding (utf8 or similar)
      • Unless you really have to verify very specific things in text, just treat it like a random binary blob.
      • 0 + $var works for converting text to numeric values.
      • If you do any type of string comparison in your code, always normalize both sides using Unicode::Normalize and always stick to the same normalization form.
      • Don't assume that any other text encoding standard is saner. Or even a global standard.

      The basic ugliness of Unicode (or other text encodings) stems not from their engineers but from the basic fact that human language is a complicated mess. And written language is still a somewhat new concept in human evolution and we are still trying to figure out the finer details. At least with Unicode, you don't have to constantly switch schemes depending on who is using your software.

      PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
      Also check out my sisters artwork and my weekly webcomics