in reply to Converting UTF-16 files to UTF-8
... I use an input file with a few (three) Ĕ in it (0x0114), saved in utf-16 by Ultraedit on win2k I end up with a file with the octets FF FE 01 14 01 14 01 14 ...
Um... If you're using ActiveState on win2k, and you have actually shown those 8 octets in their true "logical" (file sequential) order, then I'm puzzled about the data you have created using "Ultraedit".
The Byte Order Mark (BOM, \x{FEFF}) appears to be written in little-endian order (as we would expect for wintel), but if the next six byte pairs are supposed to be interpreted as "\x{0114}", they would have to be treated as big-endian.
What's up with that? I'm as mystified as you as to why your initial output has all those null bytes, but it looks like a case of "garbage in, garbage out". Try using perl to generate your test data instead:
Redirect that to a file, or pipe it directly to your elegant one-liner, and see if that gives you better results.perl -e 'binmode STDOUT,":encoding(utf16)"; print "\x{0114}\n"x3'
(update: My "data generator" one-liner was done on unix; for mswin, you need to change single-quotes to doubles and vice-versa... but then the "\x{0114}" thing breaks. Oh well -- use a bash shell or put the script in a file.)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Converting UTF-16 files to UTF-8
by demerphq (Chancellor) on May 16, 2007 at 22:54 UTC | |
by ikegami (Patriarch) on May 17, 2007 at 15:31 UTC | |
by demerphq (Chancellor) on May 17, 2007 at 18:10 UTC |