beerman has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a UTF8 file and I'm getting the UTF8 BOM which I don't want. Strange thing is that the BOM is in ISO-8859 representation. In other words it shows up as . The Unicode data being written to the file is UTF8 and as we know, you don't need a BOM with UTF8. So the question is how can I get Perl to not write the BOM. Here is a section of my program that I think is relevant to this question

use utf8; ... ... open (OUT,">:utf8", "$name") or die "cannot open file for writing"; ... print OUT $_;

My environment is Windows XP, cygwin and perl 5.10.1. Any help in printing a UTF8 file with no BOM, is much appreciated!

Replies are listed 'Best First'.
Re: Don't want BOM in output file
by Eliya (Vicar) on Oct 14, 2011 at 17:05 UTC
    Strange thing is that the BOM is in ISO-8859 representation. In other words it shows up as .

    Actually, those three bytes (EF BB BF) is the UTF-8 encoding of the BOM (there is no ISO-8859 representation of the BOM).

    As Perl doesn't automatically add a BOM with UTF-8 files (at least I've verified it doesn't on Unix, and AFAIK, Perl doesn't behave differently on Windows in this regard), I suspect the BOM already is in the data you're writing out.  Where does it (the $_ in your case) come from?

    In this case, you could remove the BOM with:  s/\x{feff}//;

Re: Don't want BOM in output file
by zentara (Cardinal) on Oct 14, 2011 at 18:06 UTC
Re: Don't want BOM in output file
by anneli (Pilgrim) on Oct 15, 2011 at 09:34 UTC

    Perl won't write the BOM for you; it sounds like that must be a part of your output. If it's being written in a different representation (in UTF-8), perhaps that's Perl's UTF-8 translating the BOM as was read (mistakenly) in the input to the valid sequence for those ISO-8859 characters in the output.

      Yes, I found the problem. The data did have the UTF8 BOM. I didn't notice the BOM until I analyzed the bytes (od -x). The issue was that perl was converting the BOM (EFBBBF) to ISO-88591-1 even though I indicated that my output should be UTF8 (open, ">:utf8", $name). The fix was to also open the input with the utf8 encoding. That is my original statement was open (INPUT, "< $inputfile") so I changed that to open (INPUT, "<:utf8", "$inputfile"). Thanks to all for the help with this. Once I realized that the input file really had the BOM, the fix was easy.

        Great! Thanks for reporting your solution back. :)

        I think the issue was solely due to the input not being UTF-8 aware; it thought the BOM was ISO-8859 (i.e. the three characters ""); then when you wrote with UTF-8 awareness, they were translated into the appropriate UTF-8 sequence (C3 AF C2 BB C2 BF), which, when read as UTF-8, translates to the codepoints for "" ..!

        I tested with this:

        our $/; open(my $in, "<", "myfile"); open(my $out, ">", "myoutfile"); my $d = <$in>; print $out $d; close $out; close $in;

        "myfile" has the content:

        0000000: efbb bf68 656c 6c6f 2c20 776f 726c 640a  ...hello, world.

        With the code above, Perl neither tries to interpret the BOM as a BOM in reading or writing, and "myoutfile" winds up like this:

        0000000: efbb bf68 656c 6c6f 2c20 776f 726c 64    ...hello, world

        (identical!) If we decide to interpret the input (only) as UTF-8, however, the BOM is interpreted as a UTF-8 sequence, and we get a warning about "Wide character in print" when trying to print it out to a filehandle that doesn't know about UTF-8:

        $ perl test.pl Wide character in print at test.pl line 10, <$in> line 1. $

        "myoutfile" still has the BOM prepended (is Perl just trying a UTF-8 representation?) in this case. The other notable thing when reading in with "<:utf8" is the value of ord($d): 0xFEFF. If we didn't use utf8, it comes out as 0xEF.

        Using utf8 on both streams causes the BOM to be faithfully read in and written out; and using utf8 only on output tries to write the individual letters as they would be interpreted in ISO-8859 with in UTF-8:

        0000000: c3af c2bb c2bf 6865 6c6c 6f2c 2077 6f72 ......hello, wor 0000010: 6c64 ld

        Fun times!

Re: Don't want BOM in output file
by Anonymous Monk on Oct 14, 2011 at 18:53 UTC
    From perlio (emphasis mine):
        :utf8
            Declares that the stream accepts perl's *internal* encoding of
            characters. (Which really is UTF-8 on ASCII machines, but is
            UTF-EBCDIC on EBCDIC machines.) ...
    
            Note that this layer does not validate byte sequences. For reading
            input, using ":encoding(utf8)" instead of bare ":utf8" is strongly
            recommended.

    I recommend looking at the utf8::all module, which wraps all these confusing utf8 machinations in one pragma, and allows you do simply use '<' or '>' as the mode when opening text files (see that module's synopsis).

      For reading input, using ":encoding(utf8)" instead of bare ":utf8" is strongly recommended.

      While this is correct (for security reasons), it's unlikely to help with the OP's (presumed) problem of getting rid of a BOM in the input data. In other words, :encoding(utf8) (just like :utf8) does not filter out the BOM:

      my $file = "somefile.utf8"; # create a UTF-8 encoded test file, explicitly adding a BOM open my $out, ">:utf8", $file or die $!; print $out "\x{feff}foo bär"; close $out; # read it back in open my $in, "<:encoding(utf8)", $file or die $!; $_ = <$in>; use Devel::Peek; Dump $_;
      SV = PV(0x793cd0) at 0x7c53e0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x7c9088 "\357\273\277foo b\303\244r"\0 [UTF8 "\x{feff}foo b\x{ +e4}r"] CUR = 11 ^^^^ LEN = 80