Re^6: UTF-8 text files with Byte Order Mark

Replies are listed 'Best First'.

Re^7: UTF-8 text files with Byte Order Mark
by ikegami (Patriarch) on May 23, 2012 at 17:39 UTC

This is a BOM for UTF-16 Big Endian-encoded files.

You are mistaken. It's the BOM, period. It can be encoded using UTF-8 and UTF-16le just as easily as with UTF-16be.

$ perl -MEncode -e'print encode("UTF-8", chr(0xFEFF))' | od -t x1
0000000 ef bb bf
0000003

$ perl -MEncode -e'print encode("UTF-16be", chr(0xFEFF))' | od -t x1
0000000 fe ff
0000002

$ perl -MEncode -e'print encode("UTF-16le", chr(0xFEFF))' | od -t x1
0000000 ff fe
0000002
[download]

FEFF	BOM
2B,2F,76,38,2D	BOM encoded using UTF-7
EF,BB,BF	BOM encoded using UTF-8
FE,FF	BOM encoded using UTF-16be
FF,FE	BOM encoded using UTF-16le
00,00,FE,FF	BOM encoded using UTF-32be
FF,FE,00,00	BOM encoded using UTF-32le

So you won't find FE,FF in a UTF-8 file, but just like in a UTF-16be file, you can find an encoded FEFF in a UTF-8 file.

[reply]
[d/l]

Re^8: UTF-8 text files with Byte Order Mark

by silentq (Novice) on May 27, 2013 at 13:44 UTC

[reply]

Re^9: UTF-8 text files with Byte Order Mark

by ikegami (Patriarch) on May 29, 2013 at 20:18 UTC

The last one is the correct one. It will remove the BOM after it's been decoded.

[reply]

Re^9: UTF-8 text files with Byte Order Mark

by Anonymous Monk on May 29, 2013 at 08:19 UTC

I'm trying my best to understand this thread, but I'm having difficulty.

Please stop trying, there is nothing for you here, read Tutorials/perlunitut: Unicode in Perl, perlunitut, use via:File::BOM

I've tried deleting it using all these commands, none of which works:

Please stop that :) Read perlunitut, use via:File::BOM , it will decode your file and remove the BOM for you

If you've got raw data you want to share you can use

perl -MData::Dump -MFile::Slurp -e " dd scalar read_file shift, { qw/ 
+binmode :raw / }; "  AnyKindOfInputFile > ThatFilesBytesAsPerlAsciiCo
+de.pl
[download]

The different ways BOM can look

 
$ perl -MFile::BOM -MData::Dump -e " dd \%File::BOM::enc2bom "
{
  # tied Readonly::Hash
  "iso-10646-1" => "\xFE\xFF",
  "UCS-2"       => "\xFE\xFF",
  "UTF-16BE"    => "\xFE\xFF",
  "UTF-16LE"    => "\xFF\xFE",
  "UTF-32BE"    => "\0\0\xFE\xFF",
  "UTF-32LE"    => "\xFF\xFE\0\0",
  "UTF-8"       => "\xEF\xBB\xBF",
  "utf8"        => "\xEF\xBB\xBF",
}
[download]

[reply]
[d/l]
[select]