comment on

I would have thought the following (quick hack) script would work:

use strict;
use warnings;

my ($inf,$outf)= @ARGV;
$inf or die "Must have a file to process\n" ;
$outf or $outf= $inf.".utf8";

open my $in, "<:encoding(utf16)", $inf
    or die "Can't open '$inf':$!";
open my $out, ">:utf8", $outf
    or die "Can't write '$outf':$!";

local $/; # slurp mode!

print {$out} <$in>        # text
    or die "Failed to convert file:$!";

close $in 
    or die "Something weird happened closing '$inf': $!";
close $out 
    or die "Failed to close '$outf', file is probably corrupted: $!";
[download]

Or even the more elegant one liner:

perl -pe "BEGIN {binmode STDIN, ':encoding(utf16)'; binmode STDOUT, ':
+utf8'}"
[download]

But it doesnt work. If I use an input file with a few (three) Ĕ in it (0x0114), saved in utf-16 by Ultraedit on win2k I end up with a file with the octets FF FE 14 01 14 01 14 01 and after conversion the output file has the octets EF BB BF C2 BE 00 14 00 01 00 14 00 01 00 14 00 01, which is just wrong. Can anybody spot what the problem is or is Perls Utf-16 support borked?

Note that this was with Perl 5.8.6 from ActiveState.

Update: Turns out that this was all down to a display bug in Ultraedit. Thanks for the help, and sorry for wasting anybody's time.

---
$world=~s/war/peace/g

In reply to Converting UTF-16 files to UTF-8 by demerphq

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.