Re^7: UTF-8 text files with Byte Order Mark

This is a BOM for UTF-16 Big Endian-encoded files.

You are mistaken. It's the BOM, period. It can be encoded using UTF-8 and UTF-16le just as easily as with UTF-16be.

$ perl -MEncode -e'print encode("UTF-8", chr(0xFEFF))' | od -t x1
0000000 ef bb bf
0000003

$ perl -MEncode -e'print encode("UTF-16be", chr(0xFEFF))' | od -t x1
0000000 fe ff
0000002

$ perl -MEncode -e'print encode("UTF-16le", chr(0xFEFF))' | od -t x1
0000000 ff fe
0000002
[download]

FEFF	BOM
2B,2F,76,38,2D	BOM encoded using UTF-7
EF,BB,BF	BOM encoded using UTF-8
FE,FF	BOM encoded using UTF-16be
FF,FE	BOM encoded using UTF-16le
00,00,FE,FF	BOM encoded using UTF-32be
FF,FE,00,00	BOM encoded using UTF-32le

So you won't find FE,FF in a UTF-8 file, but just like in a UTF-16be file, you can find an encoded FEFF in a UTF-8 file.

Comment on Re^7: UTF-8 text files with Byte Order Mark Download Code

Replies are listed 'Best First'.
Re^8: UTF-8 text files with Byte Order Mark by silentq (Novice) on May 27, 2013 at 13:44 UTC
I'm trying my best to understand this thread, but I'm having difficulty. I'm dealing with the same issue where Notepad seems to add the BOM to the beginning of UTF-8 files. I've tried deleting it using all these commands, none of which works: s/chr(0xEFBBBF)//; #remove Byte Order Mark s/\x{EFBBBF}//; s/^chr(0xFEFF)//; s/^\x{FEFF}//; Another clue: When I was using Strawberry Perl, I was able to use \x{064E} to refer to an Arabic vowel marker, and that worked. But now I'm using ActiveState, and that no longer works. But I haven't been able to reference the BOM using either Strawberry or Active State. So I'm wondering if there's some sort of package I need to reference in order to make Perl recognize the \x{NNNN} format. Any suggestions? Thanks,	[reply]
Re^9: UTF-8 text files with Byte Order Mark by ikegami (Patriarch) on May 29, 2013 at 20:18 UTC
The last one is the correct one. It will remove the BOM after it's been decoded.	[reply]
Re^9: UTF-8 text files with Byte Order Mark by Anonymous Monk on May 29, 2013 at 08:19 UTC
I'm trying my best to understand this thread, but I'm having difficulty. Please stop trying, there is nothing for you here, read Tutorials/perlunitut: Unicode in Perl, perlunitut, use via:File::BOM I've tried deleting it using all these commands, none of which works: Please stop that :) Read perlunitut, use via:File::BOM , it will decode your file and remove the BOM for you If you've got raw data you want to share you can use `perl -MData::Dump -MFile::Slurp -e " dd scalar read_file shift, { qw/ +binmode :raw / }; " AnyKindOfInputFile > ThatFilesBytesAsPerlAsciiCo +de.pl` [download] The different ways BOM can look `$ perl -MFile::BOM -MData::Dump -e " dd \%File::BOM::enc2bom " { # tied Readonly::Hash "iso-10646-1" => "\xFE\xFF", "UCS-2" => "\xFE\xFF", "UTF-16BE" => "\xFE\xFF", "UTF-16LE" => "\xFF\xFE", "UTF-32BE" => "\0\0\xFE\xFF", "UTF-32LE" => "\xFF\xFE\0\0", "UTF-8" => "\xEF\xBB\xBF", "utf8" => "\xEF\xBB\xBF", }` [download]	[reply] [d/l] [select]