Re^5: Unicode and text files

is there any way to automagically determine what encoding a file is?

That's precisely what the BOM ("byte order mark") is for. If, when creating files, you don't specify a byte order, Perl will create a BOM for you (otherwise, the file will be "BOM-less"). Files created that way (without explicit byte order) can be read by using plain :encoding(utf16):

$ /usr/bin/perl
use strict;
use warnings;

my $c = 'a';
my $fd;

open $fd, '>:encoding(utf16le)', 'foo-le' or die "open: $!";
print $fd $c;
close $fd;

open $fd, '>:encoding(utf16be)', 'foo-be' or die "open: $!";
print $fd $c;
close $fd;

open $fd, '>:encoding(utf16)', 'foo' or die "open: $!";
print $fd $c;
close $fd;
__END__
$ xxd foo-le
0000000: 6100                                     a.
$ xxd foo-be
0000000: 0061                                     .a
$ xxd foo
0000000: feff 0061                                ...a
$ /usr/bin/perl
open my $fd, '<:encoding(utf16)', 'foo' or die "open: $!";
print while <$fd>;
close $fd;
__END__
a
[download]

Update: Of course, I realized after clicking in "Create" that I really didn't answer your actual question :^). Well, if files don't have a BOM, you can only guess or brute-force them. Or add a BOM to them ;^).

--
David Serrano

Comment on Re^5: Unicode and text files Select or Download Code

Replies are listed 'Best First'.
Re^6: Unicode and text files by dirtdart (Beadle) on Oct 12, 2006 at 19:53 UTC
Thank you. That actually does help. It at least gives me a direction in which to look. I had previously been unaware of the BOM. That would also explain the odd characters that show up when I email myself a log file, but not when I open them in a text editor.	[reply]