in reply to Converting UTF-16 files to UTF-8

... I use an input file with a few (three) Ĕ in it (0x0114), saved in utf-16 by Ultraedit on win2k I end up with a file with the octets FF FE 01 14 01 14 01 14 ...

Um... If you're using ActiveState on win2k, and you have actually shown those 8 octets in their true "logical" (file sequential) order, then I'm puzzled about the data you have created using "Ultraedit".

The Byte Order Mark (BOM, \x{FEFF}) appears to be written in little-endian order (as we would expect for wintel), but if the next six byte pairs are supposed to be interpreted as "\x{0114}", they would have to be treated as big-endian.

What's up with that? I'm as mystified as you as to why your initial output has all those null bytes, but it looks like a case of "garbage in, garbage out". Try using perl to generate your test data instead:

perl -e 'binmode STDOUT,":encoding(utf16)"; print "\x{0114}\n"x3'
Redirect that to a file, or pipe it directly to your elegant one-liner, and see if that gives you better results.

(update: My "data generator" one-liner was done on unix; for mswin, you need to change single-quotes to doubles and vice-versa... but then the "\x{0114}" thing breaks. Oh well -- use a bash shell or put the script in a file.)

Replies are listed 'Best First'.
Re^2: Converting UTF-16 files to UTF-8
by demerphq (Chancellor) on May 16, 2007 at 22:54 UTC

    then I'm puzzled about the data you have created

    Well it seems you have a very good eye. :-) That was a typo on my behalf, it is actually FF FE 14 01 14 01 14 01 ...

    Ill update my original node.

    ---
    $world=~s/war/peace/g

      Even with the change, I still don't get your result with ActivePerl 5.8.8 build 820
      >debug in File not found -e100 FF FE 14 01 14 01 14 01 -rcx CX 0000 :8 -w Writing 00008 bytes -q >perl 615796.pl in out >debug out -rcx CX 0006 : -d100 l6 137A:0100 C4 94 C4 94 C4 94 -q

        There was some bug in Ultraedit that lead to my confusion. When i used debug I got the correct response, and when i upgraded UE the bogus error went away. So it was just a phantom bug and not a perl problem at all. Thanks v. much for your help.

        ---
        $world=~s/war/peace/g