jfroebe has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

In umlauts, special chars in perl regular expressions by Wouldbewarrior, he asks about support of unicode-8. I did a quick google on unicode-16 (utf16) support in Perl but only found references to utf8.

Does anyone know of support for the larger character set?

thanks

Jason L. Froebe

No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1

janitored by ybiC: Prepend node title with "Does Perl support ". One-word node titles be bad, very bad.

Replies are listed 'Best First'.
Re: Does Perl support unicode-16?
by Stevie-O (Friar) on Apr 21, 2004 at 22:53 UTC
    Ah, the confusions surrounding Unicode. For something given a name that means 'one code' there sure are a lot of different ways to specify it...

    UTF-16 is not a 'larger character set' than UTF-8.

    UTF-16 is an 'encoding', a method of storing characters in memory; it encodes most (virtually all) characters in 16 bits. Windows NT Unicode strings are UTF-16 encoded.

    UTF-8 is another encoding, and the one Perl uses internally. It encodes all of the original 7-bit ASCII characters as a single byte, identically to the way they are encoded in ANSI.

    If you have an application that's expecting UTF-16, you'll want to use the Encode module (which I believe is core, in 5.8 at least) to turn your string into one that Perl will emit as UTF-16:

    use Encode; my ($unicode_string, $utf16_string); $unicode_string = get_a_unicode_string(); # ^^ this string is a character string internally stored # as UTF-8 $utf16_string = encode('utf16', $unicode_string); # ^^ this string is an 'octet' (byte) string internally # stored as bytes. Each character of the string is stored in # two bytes of $utf_string. # (Also note the presence of a UTF-16 BOM) function_expecting_utf16($utf16_string);

    Update:

    • Fixed missing parens around multivariable my().
    • Added comment elaborating on $unicode_string.
    (Thanks, ytsh)
    --Stevie-O
    $"=$,,$_=q>|\p4<6 8p<M/_|<('=> .q>.<4-KI<l|2$<6%s!<qn#F<>;$, .=pack'N*',"@{[unpack'C*',$_] }"for split/</;$_=$,,y[A-Z a-z] {}cd;print lc
Re: Does Perl support unicode-16?
by jmcnamara (Monsignor) on Apr 21, 2004 at 22:57 UTC
Re: Does Perl support unicode-16?
by graff (Chancellor) on Apr 22, 2004 at 05:28 UTC
    In one sense, the difference between utf-16 on the one hand (either the "little-endian" or "big-endian" variety: UTF-16LE, UTF-16BE), and utf-8 on the other, is kind of like the difference between a raw binary file and a base64 or uuencoded version of that file. It's a matter of taking a stream of bits, breaking them into chunks, and adding a few bits to each chunk in a particular way, so that the result has certain desirable properties. Both utf-8 and utf-16 cover the same "value space", they simply express the values differently.

    In the case of base64 or uuencode, the desired properties are that the result is a stream of printable ascii characters, suitable for transmission via email, etc. In the case of utf-8, the desired properties are:

    • Characters that have been recognized as ascii since the invention of ascii are unmodified by the process -- they remain single-byte ascii characters, with their highest bit being clear. ASCII is really a subset of utf-8.
    • Characters above the 7-bit ascii range (i.e. values higher than 0x7f), will be rendered as two or more bytes -- these are the "wide characters" -- and all bytes involved will have their highest bit set.
    • For each wide character, the two highest bits are always "11" in the first byte of the sequence, and always "10" in each subsequent byte; actually, the number of high bits that are set in the first byte will indicate how many bytes will follow for the current wide character.
    • A variety of different algorithms will suffice to validate and interpret a utf-8 stream, and all of them should behave the same regardless of cpu type (big or little endian), because everything is done in terms of bytes.

    As mentioned previously, Perl 5.8 core does include support for all versions of unicode; it uses utf-8 internally, but can read and write data as utf-16 (BE or LE, regardless of what machine you use), by using the "decode" and "encode" functions of Encode.pm, or by using the PerlIO support for character encodings -- you can open a utf-16 file for input or output as follows (not tested):

    # a fancy version of "byte-swapping", combined with "wc" # (not suitable unless you know the input is UTF-16LE): open( INP, "<:UTF-16LE", "input.file" ); open( OUT, ">:UTF-16BE", "output.file" ); my ( $lines, $words, $chars ); while (<INP>) { $lines++; $words += scalar( split ); # we're using utf-8 now... $chars += length(); # counts _characters_ -- NOT BYTES print OUT; } printf( "%7d %7d %7d\n", $lines, $words, $chars );
    I've needed a simple script like this when porting certain text data from a wintel (LE) machine to any sort of big-endian box -- cpu dependence is one of the down-sides to the fixed-width 16-bit form of unicode, especially when there happens to be no byte-order-mark (BOM) at the start of the file...

    (update: fixed the file-handle name in the while() statement, so it matched the file-handle name in the first open statement)

Re: Does Perl support unicode-16?
by revdiablo (Prior) on Apr 21, 2004 at 22:57 UTC

    You seem to have a slight misunderstanding about Unicode and UTF. As I understand it, Unicode is basically a big table of characters (or something like characters). Each character is assigned a number. UTF-8 and UTF-16 are character encodings. They describe what number a certain set of bits represents. UTF-* is used to encode Unicode characters, but Unicode itself is not an encoding. As to your question, I am not sure how Perl handles other encodings besides UTF-8. Hopefully some other monk will explain it (or perhaps even explain where I might be misunderstanding things. This has been known to happen on occasion.)