Ah, the confusions surrounding Unicode. For something given a name that means 'one code' there sure are a lot of different ways to specify it...
UTF-16 is not a 'larger character set' than UTF-8.
UTF-16 is an 'encoding', a method of storing characters in memory; it encodes most (virtually all) characters in 16 bits. Windows NT Unicode strings are UTF-16 encoded.
UTF-8 is another encoding, and the one Perl uses internally. It encodes all of the original 7-bit ASCII characters as a single byte, identically to the way they are encoded in ANSI.
If you have an application that's expecting UTF-16, you'll want to use the Encode module (which I believe is core, in 5.8 at least) to turn your string into one that Perl will emit as UTF-16:
use Encode;
my ($unicode_string, $utf16_string);
$unicode_string = get_a_unicode_string();
# ^^ this string is a character string internally stored
# as UTF-8
$utf16_string = encode('utf16', $unicode_string);
# ^^ this string is an 'octet' (byte) string internally
# stored as bytes. Each character of the string is stored in
# two bytes of $utf_string.
# (Also note the presence of a UTF-16 BOM)
function_expecting_utf16($utf16_string);
Update:
- Fixed missing parens around multivariable my().
- Added comment elaborating on $unicode_string.
(Thanks, ytsh)
--Stevie-O
$"=$,,$_=q>|\p4<6 8p<M/_|<('=>
.q>.<4-KI<l|2$<6%s!<qn#F<>;$,
.=pack'N*',"@{[unpack'C*',$_]
}"for split/</;$_=$,,y[A-Z a-z]
{}cd;print lc
| [reply] [d/l] [select] |
| [reply] |
In one sense, the difference between utf-16 on the one hand (either the "little-endian" or "big-endian" variety: UTF-16LE, UTF-16BE), and utf-8 on the other, is kind of like the difference between a raw binary file and a base64 or uuencoded version of that file. It's a matter of taking a stream of bits, breaking them into chunks, and adding a few bits to each chunk in a particular way, so that the result has certain desirable properties. Both utf-8 and utf-16 cover the same "value space", they simply express the values differently.
In the case of base64 or uuencode, the desired properties are that the result is a stream of printable ascii characters, suitable for transmission via email, etc. In the case of utf-8, the desired properties are:
- Characters that have been recognized as ascii since the invention of ascii are unmodified by the process -- they remain single-byte ascii characters, with their highest bit being clear. ASCII is really a subset of utf-8.
- Characters above the 7-bit ascii range (i.e. values higher than 0x7f), will be rendered as two or more bytes -- these are the "wide characters" -- and all bytes involved will have their highest bit set.
- For each wide character, the two highest bits are always "11" in the first byte of the sequence, and always "10" in each subsequent byte; actually, the number of high bits that are set in the first byte will indicate how many bytes will follow for the current wide character.
- A variety of different algorithms will suffice to validate and interpret a utf-8 stream, and all of them should behave the same regardless of cpu type (big or little endian), because everything is done in terms of bytes.
As mentioned previously, Perl 5.8 core does include support for all versions of unicode; it uses utf-8 internally, but can read and write data as utf-16 (BE or LE, regardless of what machine you use), by using the "decode" and "encode" functions of Encode.pm, or by using the PerlIO support for character encodings -- you can open a utf-16 file for input or output as follows (not tested):
# a fancy version of "byte-swapping", combined with "wc"
# (not suitable unless you know the input is UTF-16LE):
open( INP, "<:UTF-16LE", "input.file" );
open( OUT, ">:UTF-16BE", "output.file" );
my ( $lines, $words, $chars );
while (<INP>) {
$lines++;
$words += scalar( split ); # we're using utf-8 now...
$chars += length(); # counts _characters_ -- NOT BYTES
print OUT;
}
printf( "%7d %7d %7d\n", $lines, $words, $chars );
I've needed a simple script like this when porting certain text data from a wintel (LE) machine to any sort of big-endian box -- cpu dependence is one of the down-sides to the fixed-width 16-bit form of unicode, especially when there happens to be no byte-order-mark (BOM) at the start of the file...
(update: fixed the file-handle name in the while() statement, so it matched the file-handle name in the first open statement) | [reply] [d/l] |
You seem to have a slight misunderstanding about Unicode and UTF. As I understand it, Unicode is basically a big table of characters (or something like characters). Each character is assigned a number. UTF-8 and UTF-16 are character encodings. They describe what number a certain set of bits represents. UTF-* is used to encode Unicode characters, but Unicode itself is not an encoding. As to your question, I am not sure how Perl handles other encodings besides UTF-8. Hopefully some other monk will explain it (or perhaps even explain where I might be misunderstanding things. This has been known to happen on occasion.)
| [reply] |