in reply to Re^5: UTF-8 text files with Byte Order Mark
in thread UTF-8 text files with Byte Order Mark

This node falls below the community's threshold of quality. You may see it by logging in.
  • Comment on Re^6: UTF-8 text files with Byte Order Mark

Replies are listed 'Best First'.
Re^7: UTF-8 text files with Byte Order Mark
by ikegami (Patriarch) on May 23, 2012 at 17:39 UTC

    This is a BOM for UTF-16 Big Endian-encoded files.

    You are mistaken. It's the BOM, period. It can be encoded using UTF-8 and UTF-16le just as easily as with UTF-16be.

    $ perl -MEncode -e'print encode("UTF-8", chr(0xFEFF))' | od -t x1 0000000 ef bb bf 0000003 $ perl -MEncode -e'print encode("UTF-16be", chr(0xFEFF))' | od -t x1 0000000 fe ff 0000002 $ perl -MEncode -e'print encode("UTF-16le", chr(0xFEFF))' | od -t x1 0000000 ff fe 0000002
    FEFFBOM
    2B,2F,76,38,2DBOM encoded using UTF-7
    EF,BB,BFBOM encoded using UTF-8
    FE,FFBOM encoded using UTF-16be
    FF,FEBOM encoded using UTF-16le
    00,00,FE,FFBOM encoded using UTF-32be
    FF,FE,00,00BOM encoded using UTF-32le

    So you won't find FE,FF in a UTF-8 file, but just like in a UTF-16be file, you can find an encoded FEFF in a UTF-8 file.

      I'm trying my best to understand this thread, but I'm having difficulty.
      I'm dealing with the same issue where Notepad seems to add the BOM to the beginning of UTF-8 files. I've tried deleting it using all these commands, none of which works:

      s/chr(0xEFBBBF)//; #remove Byte Order Mark
      s/\x{EFBBBF}//;
      s/^chr(0xFEFF)//;
      s/^\x{FEFF}//;

      Another clue: When I was using Strawberry Perl, I was able to use \x{064E} to refer to an Arabic vowel marker, and that worked. But now I'm using ActiveState, and that no longer works.
      But I haven't been able to reference the BOM using either Strawberry or Active State. So I'm wondering if there's some sort of package I need to reference in order to make Perl recognize the \x{NNNN} format. Any suggestions?
      Thanks,
        The last one is the correct one. It will remove the BOM after it's been decoded.

        I'm trying my best to understand this thread, but I'm having difficulty.

        Please stop trying, there is nothing for you here, read Tutorials/perlunitut: Unicode in Perl, perlunitut, use via:File::BOM

        I've tried deleting it using all these commands, none of which works:

        Please stop that :) Read perlunitut, use via:File::BOM , it will decode your file and remove the BOM for you

        If you've got raw data you want to share you can use

        perl -MData::Dump -MFile::Slurp -e " dd scalar read_file shift, { qw/ +binmode :raw / }; " AnyKindOfInputFile > ThatFilesBytesAsPerlAsciiCo +de.pl

        The different ways BOM can look

        $ perl -MFile::BOM -MData::Dump -e " dd \%File::BOM::enc2bom " { # tied Readonly::Hash "iso-10646-1" => "\xFE\xFF", "UCS-2" => "\xFE\xFF", "UTF-16BE" => "\xFE\xFF", "UTF-16LE" => "\xFF\xFE", "UTF-32BE" => "\0\0\xFE\xFF", "UTF-32LE" => "\xFF\xFE\0\0", "UTF-8" => "\xEF\xBB\xBF", "utf8" => "\xEF\xBB\xBF", }