in reply to Re: Another Unicode/emoji question
in thread Another Unicode/emoji question

If Unicode still makes problems, check (with a hex editor) if the file you generate has a byte order mark

Even after applying the encoding from Re^3: Another Unicode/emoji question, it is not displaying, so I need to try BOM next...

Any suggestions on the best way to add the BOM? File::BOM and String::BOM appear only to read the files, not generate them. Or is it as simple as writing \x{efbbbf} as the first thing after the HTTP headers?

Replies are listed 'Best First'.
Re: BOM (was: Re^2: Another Unicode/emoji question)
by pryrt (Abbot) on Dec 24, 2023 at 16:10 UTC
    Or is it as simple as writing \x{efbbbf} as the first thing after the HTTP headers?

    The string my $str = "\x{efbbbf}"; does not contain the BOM character, it contains U+EFBBBF, which is not valid a valid Unicode character (AFAIK: I believe Unicode only goes to U+1FFFFF U+10FFFF). The string my $str = "\x{feff}"; contains the BOM character.

    If you did use the string you suggested, whether with raw mode or with UTF-8 output encoding, you will not get what you thought:

    C:\Users\Peter> perl -e "binmode STDOUT, ':raw'; print qq(\x{efbbbf})" + | xxd Wide character in print at -e line 1. 00000000: f8bb bbae bf ..... C:\Users\Peter> perl -e "use open ':std' => ':encoding(UTF-8)'; print +qq(\x{efbbbf})" | xxd Code point 0xEFBBBF is not Unicode, may not be portable in print at -e + line 1. 00000000: 5c78 7b45 4642 4242 467d \x{EFBBBF}

    Neither of those outputs the UTF-8 bytes for the BOM U+FEFF character.

    Instead, you either need to manually send the three octets separately in raw mode, or use raw mode and manually encode from a perl string into UTF-8 bytes, or use UTF-8 output encoding and send the U+FEFF character from the string directly:

    C:\Users\Peter> perl -e "binmode STDOUT, ':raw'; print qq(\xef\xbb\xbf +)" | xxd 00000000: efbb bf ... C:\Users\Peter> perl -MEncode -e "binmode STDOUT, ':raw'; print Encode +::encode('UTF-8', qq(\x{feff}));" | xxd 00000000: efbb bf ... C:\Users\Peter> perl -e "use open ':std' => ':encoding(UTF-8)'; print +qq(\x{feff})" | xxd 00000000: efbb bf ...

    Whether or not that would "work" in your use-case is something I don't know: my guess is that it won't help, because anything that's using HTTP headers should be paying attention to the encoding listed in the headers, and not requiring a BOM in the message body. Though I guess if it's saving the HTTP message body into a file, and then later using that file, maybe the BOM would help. I don't know on that, sorry.

    --
    warning: Windows quoting used in code blocks; swap quotes around if you're on linux

      my guess is that it won't help, because anything that's using HTTP headers should be paying attention to the encoding listed in the headers, and not requiring a BOM in the message body

      This article implies that a BOM is needed. But, I'm not sure if that applies to a URL calendar feed or just an imported file.