in reply to Re^2: Another Unicode/emoji question
in thread Another Unicode/emoji question

You forgot to encode.

Simple way to fix:

use open ":std", ":encoding(UTF-8)";

Replies are listed 'Best First'.
Re^4: Another Unicode/emoji question
by Bod (Parson) on Dec 23, 2023 at 11:47 UTC
    You forgot to encode

    I didn't forget...I didn't know I needed to...

    Simple way to fix

    The open pragma is new to me...thanks for bringing it to my attention.

    I've been reading the documentation and an ikegami answer on SO and I am unsure what use open ":std", ":encoding(UTF-8)"; is doing that isn't already there from use utf8; and Content-type: text/calendar; charset=utf-8;

    ":std" apply the layer to the global filehandles STDIN, STDOUt and STDERR - is that right?

    But what is the difference between ":encoding(UTF-8)" and ":utf8"? I gather utf8 is a Perl specific implementation of UTF-8 but how does this difference manifest in practical terms?

    UPDATE:
    Somehow I missed use open in kcott's excellent answer here Re: Another Unicode/emoji question

      "... and I am unsure what use open ":std", ":encoding(UTF-8)"; is doing that isn't already there from use utf8; and ..." [my emphasis]

      The utf8 pragma tells Perl that your source code contains Unicode characters. It has nothing to do with input or output. See the emboldened text near the start of the DESCRIPTION in that documentation:

      "Do not use this pragma for anything else than telling Perl that your script is written in UTF-8."

      In my code example which you referenced, there is the statement "say q{🐶 = }, "🐶";". That statement contains Unicode characters and therefore I need "use utf8;". The rest of that code only contains 7-bit ASCII characters. If I were to remove "say q{🐶 = }, "🐶";", I wouldn't need "use utf8;".

      $ perl -E '
          use strict;
          use warnings;
          use open OUT => qw{:encoding(UTF-8) :std};
          say q{\x{1f436} = }, "\x{1f436}";
          say q{\x{1F436} = }, "\x{1F436}";
          say q{\N{DOG FACE} = }, "\N{DOG FACE}";
      '
      \x{1f436} = 🐶
      \x{1F436} = 🐶
      \N{DOG FACE} = 🐶
      

      I don't know if it's difficult to get your head around (and I certainly don't mean to be patronising or condescending) but "\N{DOG FACE}" is part of the source code which contains twelve 7-bit ASCII characters: you do not need the utf8 pragma for this.

      The "\N{DOG FACE}" resolves to a Unicode character in the output and the open pragma handles that. Here's what happens if I omit the open pragma:

      $ perl -E '
          use strict;
          use warnings;
          say q{\x{1f436} = }, "\x{1f436}";
          say q{\x{1F436} = }, "\x{1F436}";
          say q{\N{DOG FACE} = }, "\N{DOG FACE}";
      '
      Wide character in say at -e line 4.
      \x{1f436} = 🐶
      Wide character in say at -e line 5.
      \x{1F436} = 🐶
      Wide character in say at -e line 6.
      \N{DOG FACE} = 🐶
      

      And note that all of those "Wide character" warnings remain if I use the utf8 pragma:

      $ perl -E '
          use strict;
          use warnings;
          use utf8;
          say q{\x{1f436} = }, "\x{1f436}";
          say q{\x{1F436} = }, "\x{1F436}";
          say q{\N{DOG FACE} = }, "\N{DOG FACE}";
          say q{🐶 = }, "🐶";
      '
      Wide character in say at -e line 5.
      \x{1f436} = 🐶
      Wide character in say at -e line 6.
      \x{1F436} = 🐶
      Wide character in say at -e line 7.
      \N{DOG FACE} = 🐶
      Wide character in say at -e line 8.
      Wide character in say at -e line 8.
      🐶 = 🐶
      

      As further examples, see "uparse - Parse Unicode strings", and its improved version "Re: Decoding @ARGV [Was: uparse - Parse Unicode strings]", both of which read and write Unicode characters and use the open pragma, but neither contains Unicode characters in the source code so neither uses the utf8 pragma.

      [Aside: I'm going away for Christmas and won't be touching any computer equipment. Christmas Day is less than two hours away in my time zone. If you have further questions regarding this, I won't get to them for a few days; although, someone else might.]

      — Ken

      File handles expect bytes by default.

      You're providing Unicode Code Points.

      Converting text to bytes is called encoding. You told the other end you were providing UTF-8, but you never applied this encoding.

      But what is the difference between ":encoding(UTF-8)" and ":utf8"?

      `:encoding(UTF-8)` vs `:encoding(utf8)` vs `:utf8`