Re^2: Another Unicode/emoji question

Replies are listed 'Best First'.
Re^3: Another Unicode/emoji question by ikegami (Patriarch) on Dec 23, 2023 at 05:46 UTC
You forgot to encode. Simple way to fix: `use open ":std", ":encoding(UTF-8)";` [download]	[reply] [d/l]
Re^4: Another Unicode/emoji question by Bod (Parson) on Dec 23, 2023 at 11:47 UTC
You forgot to encode I didn't forget...I didn't know I needed to... Simple way to fix The `open` pragma is new to me...thanks for bringing it to my attention. I've been reading the documentation and an ikegami answer on SO and I am unsure what `use open ":std", ":encoding(UTF-8)";` is doing that isn't already there from `use utf8;` and `Content-type: text/calendar; charset=utf-8;` `":std"` apply the layer to the global filehandles `STDIN`, `STDOUt` and `STDERR` - is that right? But what is the difference between `":encoding(UTF-8)"` and `":utf8"`? I gather `utf8` is a Perl specific implementation of UTF-8 but how does this difference manifest in practical terms? UPDATE: Somehow I missed `use open` in kcott's excellent answer here Re: Another Unicode/emoji question	[reply] [d/l] [select]
Re^5: Another Unicode/emoji question by kcott (Archbishop) on Dec 24, 2023 at 11:11 UTC
"... and I am unsure what `use open ":std", ":encoding(UTF-8)";` is doing that isn't already there from `use utf8;`* and ..."* [my emphasis] The utf8 pragma tells Perl that your source code contains Unicode characters. It has nothing to do with input or output. See the emboldened text near the start of the DESCRIPTION in that documentation: "Do not use this pragma for anything else than telling Perl that your script is written in UTF-8." In my code example which you referenced, there is the statement "`say q{🐶 = }, "🐶";`". That statement contains Unicode characters and therefore I need "`use utf8;`". The rest of that code only contains 7-bit ASCII characters. If I were to remove "`say q{🐶 = }, "🐶";`", I wouldn't need "`use utf8;`". $ perl -E ' use strict; use warnings; use open OUT => qw{:encoding(UTF-8) :std}; say q{\x{1f436} = }, "\x{1f436}"; say q{\x{1F436} = }, "\x{1F436}"; say q{\N{DOG FACE} = }, "\N{DOG FACE}"; ' \x{1f436} = 🐶 \x{1F436} = 🐶 \N{DOG FACE} = 🐶 I don't know if it's difficult to get your head around (and I certainly don't mean to be patronising or condescending) but "`\N{DOG FACE}`" is part of the source code which contains twelve 7-bit ASCII characters: you do not need the `utf8` pragma for this. The "`\N{DOG FACE}`" resolves to a Unicode character in the output and the open pragma handles that. Here's what happens if I omit the `open` pragma: $ perl -E ' use strict; use warnings; say q{\x{1f436} = }, "\x{1f436}"; say q{\x{1F436} = }, "\x{1F436}"; say q{\N{DOG FACE} = }, "\N{DOG FACE}"; ' Wide character in say at -e line 4. \x{1f436} = 🐶 Wide character in say at -e line 5. \x{1F436} = 🐶 Wide character in say at -e line 6. \N{DOG FACE} = 🐶 And note that all of those "Wide character" warnings remain if I use the `utf8` pragma: $ perl -E ' use strict; use warnings; use utf8; say q{\x{1f436} = }, "\x{1f436}"; say q{\x{1F436} = }, "\x{1F436}"; say q{\N{DOG FACE} = }, "\N{DOG FACE}"; say q{🐶 = }, "🐶"; ' Wide character in say at -e line 5. \x{1f436} = 🐶 Wide character in say at -e line 6. \x{1F436} = 🐶 Wide character in say at -e line 7. \N{DOG FACE} = 🐶 Wide character in say at -e line 8. Wide character in say at -e line 8. 🐶 = 🐶 As further examples, see "uparse - Parse Unicode strings", and its improved version "Re: Decoding @ARGV [Was: uparse - Parse Unicode strings]", both of which read and write Unicode characters and use the `open` pragma, but neither contains Unicode characters in the source code so neither uses the `utf8` pragma. [Aside: I'm going away for Christmas and won't be touching any computer equipment. Christmas Day is less than two hours away in my time zone. If you have further questions regarding this, I won't get to them for a few days; although, someone else might.] — Ken	[reply] [d/l] [select]
Re^5: Another Unicode/emoji question by ikegami (Patriarch) on Dec 23, 2023 at 14:42 UTC
File handles expect bytes by default. You're providing Unicode Code Points. Converting text to bytes is called encoding. You told the other end you were providing UTF-8, but you never applied this encoding. But what is the difference between ":encoding(UTF-8)" and ":utf8"? `:encoding(UTF-8)` vs `:encoding(utf8)` vs `:utf8`	[reply]