In one sense, the difference between utf-16 on the one hand (either the "little-endian" or "big-endian" variety: UTF-16LE, UTF-16BE), and utf-8 on the other, is kind of like the difference between a raw binary file and a base64 or uuencoded version of that file. It's a matter of taking a stream of bits, breaking them into chunks, and adding a few bits to each chunk in a particular way, so that the result has certain desirable properties. Both utf-8 and utf-16 cover the same "value space", they simply express the values differently.
In the case of base64 or uuencode, the desired properties are that the result is a stream of printable ascii characters, suitable for transmission via email, etc. In the case of utf-8, the desired properties are:
- Characters that have been recognized as ascii since the invention of ascii are unmodified by the process -- they remain single-byte ascii characters, with their highest bit being clear. ASCII is really a subset of utf-8.
- Characters above the 7-bit ascii range (i.e. values higher than 0x7f), will be rendered as two or more bytes -- these are the "wide characters" -- and all bytes involved will have their highest bit set.
- For each wide character, the two highest bits are always "11" in the first byte of the sequence, and always "10" in each subsequent byte; actually, the number of high bits that are set in the first byte will indicate how many bytes will follow for the current wide character.
- A variety of different algorithms will suffice to validate and interpret a utf-8 stream, and all of them should behave the same regardless of cpu type (big or little endian), because everything is done in terms of bytes.
As mentioned previously, Perl 5.8 core does include support for all versions of unicode; it uses utf-8 internally, but can read and write data as utf-16 (BE or LE, regardless of what machine you use), by using the "decode" and "encode" functions of Encode.pm, or by using the PerlIO support for character encodings -- you can open a utf-16 file for input or output as follows (not tested):
# a fancy version of "byte-swapping", combined with "wc"
# (not suitable unless you know the input is UTF-16LE):
open( INP, "<:UTF-16LE", "input.file" );
open( OUT, ">:UTF-16BE", "output.file" );
my ( $lines, $words, $chars );
while (<INP>) {
$lines++;
$words += scalar( split ); # we're using utf-8 now...
$chars += length(); # counts _characters_ -- NOT BYTES
print OUT;
}
printf( "%7d %7d %7d\n", $lines, $words, $chars );
I've needed a simple script like this when porting certain text data from a wintel (LE) machine to any sort of big-endian box -- cpu dependence is one of the down-sides to the fixed-width 16-bit form of unicode, especially when there happens to be no byte-order-mark (BOM) at the start of the file...
(update: fixed the file-handle name in the while() statement, so it matched the file-handle name in the first open statement)
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.