uva has asked for the wisdom of the Perl Monks concerning the following question:
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: help needed in utf16
by Corion (Patriarch) on Mar 27, 2006 at 13:53 UTC | |
Did you try a Google search? It pointed me to this mail, which indicates that you can open a file as UTF16 like this:
I've never used anything of it, but I'm working from this line in the mail:
| [reply] [d/l] [select] |
|
Re: help needed in utf16
by graff (Chancellor) on Mar 27, 2006 at 23:36 UTC | |
update: After playing around with UTF-16 in perl, I have rearranged and modified the information below, and added some one-liners that I found instructive. If the data file comes from a "well-behaved" application, the first character will be a byte-order mark (BOM, "\x{feff}"), and using ":encoding(UTF-16)" on the file handle will always do the right thing. But if you use "UTF-16" and there is no BOM in the data, perl will complain, as shown below. (The quoting in the following one-liner examples assumes a "bash"-style shell, and I use unix "od" to view hex and character dumps of the output.)
SO: with utf16 data that has no BOM, it is
Any ASCII characters in your data (e.g. spaces, tabs, carriage returns, line-feeds, alphanumerics, etc) will have a null byte as the "high byte" of the 16-bit character value; if the null byte shows up at an even-numbered byte offset (where the first byte of the file is at offset "0"), the data is "big-endian", and
On the other hand, if the null bytes show up at odd byte offsets, the data are little-endian, so There are CPAN modules for the BOM, but you can also check it yourself: (this sample code has been heavily updated relative to initial posting, to include a usage statement, handling of an appropriate command-line option for byte order, proper use of "pack" to test for the BOM value, and proper handling when BOM is present or absent.) | [reply] [d/l] [select] |
|
Re: help needed in utf16
by SamCG (Hermit) on Mar 27, 2006 at 18:55 UTC | |
I've successfully used: though I'd also point out that it's not recommended to use bare filehandles anymore. However, I'm not sure if not using the binmode function is actually your problem, since you don't describe your error. What I saw before using the binmode was the file would appear broken -- instead of "this\tis\tthe\file" (in a tab-delimited file), I'd see "t h i s i s t h e f i l e" (or something close to that). | [reply] [d/l] |