in reply to Another Unicode/emoji question

Aside from the answer about the correct Unicode code point by kcott, it seems some calendar systems are more forgiving than others when it comes to encoding.

If Unicode still makes problems, check (with a hex editor) if the file you generate has a byte order mark. I haven't played with ICS calender stuff in many, MANY years, so i'm just guessing. But at least a quick google suggest that a BOM might be required.

PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP

Replies are listed 'Best First'.
BOM (was: Re^2: Another Unicode/emoji question)
by Bod (Parson) on Dec 23, 2023 at 16:46 UTC
    If Unicode still makes problems, check (with a hex editor) if the file you generate has a byte order mark

    Even after applying the encoding from Re^3: Another Unicode/emoji question, it is not displaying, so I need to try BOM next...

    Any suggestions on the best way to add the BOM? File::BOM and String::BOM appear only to read the files, not generate them. Or is it as simple as writing \x{efbbbf} as the first thing after the HTTP headers?

      Or is it as simple as writing \x{efbbbf} as the first thing after the HTTP headers?

      The string my $str = "\x{efbbbf}"; does not contain the BOM character, it contains U+EFBBBF, which is not valid a valid Unicode character (AFAIK: I believe Unicode only goes to U+1FFFFF U+10FFFF). The string my $str = "\x{feff}"; contains the BOM character.

      If you did use the string you suggested, whether with raw mode or with UTF-8 output encoding, you will not get what you thought:

      C:\Users\Peter> perl -e "binmode STDOUT, ':raw'; print qq(\x{efbbbf})" + | xxd Wide character in print at -e line 1. 00000000: f8bb bbae bf ..... C:\Users\Peter> perl -e "use open ':std' => ':encoding(UTF-8)'; print +qq(\x{efbbbf})" | xxd Code point 0xEFBBBF is not Unicode, may not be portable in print at -e + line 1. 00000000: 5c78 7b45 4642 4242 467d \x{EFBBBF}

      Neither of those outputs the UTF-8 bytes for the BOM U+FEFF character.

      Instead, you either need to manually send the three octets separately in raw mode, or use raw mode and manually encode from a perl string into UTF-8 bytes, or use UTF-8 output encoding and send the U+FEFF character from the string directly:

      C:\Users\Peter> perl -e "binmode STDOUT, ':raw'; print qq(\xef\xbb\xbf +)" | xxd 00000000: efbb bf ... C:\Users\Peter> perl -MEncode -e "binmode STDOUT, ':raw'; print Encode +::encode('UTF-8', qq(\x{feff}));" | xxd 00000000: efbb bf ... C:\Users\Peter> perl -e "use open ':std' => ':encoding(UTF-8)'; print +qq(\x{feff})" | xxd 00000000: efbb bf ...

      Whether or not that would "work" in your use-case is something I don't know: my guess is that it won't help, because anything that's using HTTP headers should be paying attention to the encoding listed in the headers, and not requiring a BOM in the message body. Though I guess if it's saving the HTTP message body into a file, and then later using that file, maybe the BOM would help. I don't know on that, sorry.

      --
      warning: Windows quoting used in code blocks; swap quotes around if you're on linux

        my guess is that it won't help, because anything that's using HTTP headers should be paying attention to the encoding listed in the headers, and not requiring a BOM in the message body

        This article implies that a BOM is needed. But, I'm not sure if that applies to a URL calendar feed or just an imported file.

Re^2: Another Unicode/emoji question
by Bod (Parson) on Dec 22, 2023 at 15:30 UTC
    If Unicode still makes problems, check (with a hex editor) if the file you generate has a byte order mark

    Thanks. My URL doesn't have a BOM and I hadn't considered the possibility that it might need one!

    I have changed the code as suggested by kcott and I'll now wait until tomorrow for Google to hit my calendar endpoint and see if works. If not, another day and I'll look into whether a BOM is the issue. Debugging this is taking forever as I have to wait for Google to refresh which is does approximately once per day.

Re^2: Another Unicode/emoji question
by Bod (Parson) on Dec 23, 2023 at 09:51 UTC

    It looks like a BOM is the next thing to try...

    Google has updated the calendar and displayed \x{1f436} instead of 🐶