in reply to Re^4: Unicode and text files
in thread Unicode and text files

is there any way to automagically determine what encoding a file is?

That's precisely what the BOM ("byte order mark") is for. If, when creating files, you don't specify a byte order, Perl will create a BOM for you (otherwise, the file will be "BOM-less"). Files created that way (without explicit byte order) can be read by using plain :encoding(utf16):

$ /usr/bin/perl use strict; use warnings; my $c = 'a'; my $fd; open $fd, '>:encoding(utf16le)', 'foo-le' or die "open: $!"; print $fd $c; close $fd; open $fd, '>:encoding(utf16be)', 'foo-be' or die "open: $!"; print $fd $c; close $fd; open $fd, '>:encoding(utf16)', 'foo' or die "open: $!"; print $fd $c; close $fd; __END__ $ xxd foo-le 0000000: 6100 a. $ xxd foo-be 0000000: 0061 .a $ xxd foo 0000000: feff 0061 ...a $ /usr/bin/perl open my $fd, '<:encoding(utf16)', 'foo' or die "open: $!"; print while <$fd>; close $fd; __END__ a

Update: Of course, I realized after clicking in "Create" that I really didn't answer your actual question :^). Well, if files don't have a BOM, you can only guess or brute-force them. Or add a BOM to them ;^).

.

--
David Serrano

Replies are listed 'Best First'.
Re^6: Unicode and text files
by dirtdart (Beadle) on Oct 12, 2006 at 19:53 UTC
    Thank you. That actually does help. It at least gives me a direction in which to look. I had previously been unaware of the BOM. That would also explain the odd characters that show up when I email myself a log file, but not when I open them in a text editor.