in reply to Re^6: UTF-8 text files with Byte Order Mark
in thread UTF-8 text files with Byte Order Mark

This is a BOM for UTF-16 Big Endian-encoded files.

You are mistaken. It's the BOM, period. It can be encoded using UTF-8 and UTF-16le just as easily as with UTF-16be.

$ perl -MEncode -e'print encode("UTF-8", chr(0xFEFF))' | od -t x1 0000000 ef bb bf 0000003 $ perl -MEncode -e'print encode("UTF-16be", chr(0xFEFF))' | od -t x1 0000000 fe ff 0000002 $ perl -MEncode -e'print encode("UTF-16le", chr(0xFEFF))' | od -t x1 0000000 ff fe 0000002
FEFFBOM
2B,2F,76,38,2DBOM encoded using UTF-7
EF,BB,BFBOM encoded using UTF-8
FE,FFBOM encoded using UTF-16be
FF,FEBOM encoded using UTF-16le
00,00,FE,FFBOM encoded using UTF-32be
FF,FE,00,00BOM encoded using UTF-32le

So you won't find FE,FF in a UTF-8 file, but just like in a UTF-16be file, you can find an encoded FEFF in a UTF-8 file.

Replies are listed 'Best First'.
Re^8: UTF-8 text files with Byte Order Mark
by silentq (Novice) on May 27, 2013 at 13:44 UTC
    I'm trying my best to understand this thread, but I'm having difficulty.
    I'm dealing with the same issue where Notepad seems to add the BOM to the beginning of UTF-8 files. I've tried deleting it using all these commands, none of which works:

    s/chr(0xEFBBBF)//; #remove Byte Order Mark
    s/\x{EFBBBF}//;
    s/^chr(0xFEFF)//;
    s/^\x{FEFF}//;

    Another clue: When I was using Strawberry Perl, I was able to use \x{064E} to refer to an Arabic vowel marker, and that worked. But now I'm using ActiveState, and that no longer works.
    But I haven't been able to reference the BOM using either Strawberry or Active State. So I'm wondering if there's some sort of package I need to reference in order to make Perl recognize the \x{NNNN} format. Any suggestions?
    Thanks,
      The last one is the correct one. It will remove the BOM after it's been decoded.

      I'm trying my best to understand this thread, but I'm having difficulty.

      Please stop trying, there is nothing for you here, read Tutorials/perlunitut: Unicode in Perl, perlunitut, use via:File::BOM

      I've tried deleting it using all these commands, none of which works:

      Please stop that :) Read perlunitut, use via:File::BOM , it will decode your file and remove the BOM for you

      If you've got raw data you want to share you can use

      perl -MData::Dump -MFile::Slurp -e " dd scalar read_file shift, { qw/ +binmode :raw / }; " AnyKindOfInputFile > ThatFilesBytesAsPerlAsciiCo +de.pl

      The different ways BOM can look

      $ perl -MFile::BOM -MData::Dump -e " dd \%File::BOM::enc2bom " { # tied Readonly::Hash "iso-10646-1" => "\xFE\xFF", "UCS-2" => "\xFE\xFF", "UTF-16BE" => "\xFE\xFF", "UTF-16LE" => "\xFF\xFE", "UTF-32BE" => "\0\0\xFE\xFF", "UTF-32LE" => "\xFF\xFE\0\0", "UTF-8" => "\xEF\xBB\xBF", "utf8" => "\xEF\xBB\xBF", }