Re: Converting UTF-16 files to UTF-8

... I use an input file with a few (three) Ĕ in it (0x0114), saved in utf-16 by Ultraedit on win2k I end up with a file with the octets FF FE 01 14 01 14 01 14 ...

Um... If you're using ActiveState on win2k, and you have actually shown those 8 octets in their true "logical" (file sequential) order, then I'm puzzled about the data you have created using "Ultraedit".

The Byte Order Mark (BOM, \x{FEFF}) appears to be written in little-endian order (as we would expect for wintel), but if the next six byte pairs are supposed to be interpreted as "\x{0114}", they would have to be treated as big-endian.

What's up with that? I'm as mystified as you as to why your initial output has all those null bytes, but it looks like a case of "garbage in, garbage out". Try using perl to generate your test data instead:

perl -e 'binmode STDOUT,":encoding(utf16)"; print "\x{0114}\n"x3'
[download]

Redirect that to a file, or pipe it directly to your elegant one-liner, and see if that gives you better results.

(update: My "data generator" one-liner was done on unix; for mswin, you need to change single-quotes to doubles and vice-versa... but then the "\x{0114}" thing breaks. Oh well -- use a bash shell or put the script in a file.)

Comment on Re: Converting UTF-16 files to UTF-8 Select or Download Code

Replies are listed 'Best First'.
Re^2: Converting UTF-16 files to UTF-8 by demerphq (Chancellor) on May 16, 2007 at 22:54 UTC
then I'm puzzled about the data you have created Well it seems you have a very good eye. :-) That was a typo on my behalf, it is actually FF FE 14 01 14 01 14 01 ... Ill update my original node. --- $world=~s/war/peace/g	[reply]
Re^3: Converting UTF-16 files to UTF-8 by ikegami (Patriarch) on May 17, 2007 at 15:31 UTC
Even with the change, I still don't get your result with ActivePerl 5.8.8 build 820 `>debug in File not found -e100 FF FE 14 01 14 01 14 01 -rcx CX 0000 :8 -w Writing 00008 bytes -q >perl 615796.pl in out >debug out -rcx CX 0006 : -d100 l6 137A:0100 C4 94 C4 94 C4 94 -q` [download]	[reply] [d/l]
Re^4: Converting UTF-16 files to UTF-8 by demerphq (Chancellor) on May 17, 2007 at 18:10 UTC
There was some bug in Ultraedit that lead to my confusion. When i used debug I got the correct response, and when i upgraded UE the bogus error went away. So it was just a phantom bug and not a perl problem at all. Thanks v. much for your help. --- $world=~s/war/peace/g	[reply]