Re^5: How is perl able to handle the null byte?

Is there no possibility that when encoding a string to one of the many forms of unicode for output to an external system that there might legitimately be null bytes embedded within the string?

The only forms of unicode that can involve a null byte as part of a non-null character are the 16-bit encodings: UTF-16LE and UTF-16BE.

"UTF-16" without the byte-order spec refers either to "whatever the native byte-order is on the current cpu" or to a data file encoded as 16-bit unicode characters and having a byte-order-mark (BOM, U+FEFF) as the very first character, so that unicode-aware readers know whether they need to swap bytes in order for their current cpu to see the intended 16-bit character values. Think of UTF-16* as if it were 16-bit PCM audio data: you need to handle it in two-byte chunks, and you need to know which of the two is the "least significant byte"; if you treat it as just bytes, anything can happen.

So it's a pretty nice feature that Perl uses utf-8 as its internal string representation, and not utf-16. This encoding is analogous to uuencoding or base64 encoding, though it's actually a bit more clever: the intention is to convey 16 bits worth of data using only a limited range of possible byte values, but the number of bytes needed to convey that value will tend to be fewer for the "simpler" characters (those in the lower range of the 16-bit space) than for the "heavier" characters (those in the higher range).

Because of the design, ASCII characters (00-7F) remain single-byte characters in utf-8; code points U+0080 through U+08FF need two bytes, and from U+0900 through U+FFFF you need three bytes. In the multi-byte "wide" characters, all bytes have their high bits set, so as not to be confusable with ASCII. (The "Unicode Encodings" section of the perlunicode man page provides all the details quite nicely.)

Of course, the whole notion of "wide characters" in C has the same status as the notion of "strings" -- i.e. it's a convenient fiction; that's why all the pre-unicode wide-character encodings (for Chinese, Japanese and Korean) never used a null byte as a component of a multi-byte character.

(<update> Regarding this question: I feel sure that some of the MS wide character sets contain some characters where one half of the 16-bit values can be null... -- Well, now that you mention it, I've looked at hex dumps of Word files containing unicode characters, and they actually alternate at block boundaries (2KB blocks, I think, but I forget) between single-byte character encoding for blocks that don't contain wide characters, vs. UTF-16LE encoding for blocks with wide characters in them. Pretty scary stuff -- I would call it brain-damaged. But none of the 2-byte "legacy" MS/DOS code pages (e.g. CP936) ever used null bytes as part of a code point. </update>)

I guess if people wanted to pursue the notion of granting special status to character strings in order to enable some sort of trap or check for "embedded-null-byte", there would have to be a flag on the SV that says "this is a character string (so if you see an embedded null byte, that would mean something is wrong)."

Since SV's are used to store all kinds of stuff, some of which is expected to include null bytes by nature, there would have to be something similiar to the utf8 flag, that says "this is really character data, and I'd be worried if there were a null byte in it". Then, every SV-to-char* operation would need to know whether the char* is going to be used as a character string in C, and if so, check that flag.

Comment on Re^5: How is perl able to handle the null byte?

Replies are listed 'Best First'.
Re^6: How is perl able to handle the null byte? by BrowserUk (Patriarch) on Jun 17, 2006 at 07:45 UTC
This is the passage from the ('a', possibly superceded), ANSI/ISO C spec that worries me: From Wikipedia (my emphasis). The Unicode standard 4.0 says that "ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension." Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^7: How is perl able to handle the null byte? by graff (Chancellor) on Jun 18, 2006 at 00:42 UTC
That quotation comes from section 5.2 of the Unicode standard (in the current 4.1 version), and it is about the "wchar_t" data type in C. If/when the Perl-to-C interface includes conversions from (say) utf8 Perl strings to C wchar_t, this detail would need to be taken into account. I don't actually know whether XS covers conversion from SV to wchar_t at present, but I believe this is not relevant to the SV-to-char issue being discussed in this thread.	[reply]
Re^8: How is perl able to handle the null byte? by BrowserUk (Patriarch) on Jun 18, 2006 at 07:13 UTC
My point is that when you encode strings to formats other than Perl's internal utf8 format, you end up with a scalar without the utf flag set, in which multiple bytes are used to represent single characters. In some of these encodings, some individual bytes of the multi-byte characters can be null. Eg. As you pointed out the UTF-16 variants. These encoded values often need to be passed to system apis. At that point, warning or dieing because the scalar contains embedded nulls would be wrong and if Perl was to try and implement such warnings or traps for embedded null bytes, it would need to detect the above situations. Example. The often questioned, but (to my knowledge), yet to be resolved problem of globbing the windows filesystem for files with unicode filenames. Win32 has (for a long time now), a very functional set of APIs for dealing with Unicode (wide) filenames. In order to use them, it is necessary to convert the path and/or wildcard inputs to the FindFirstFileW() call, to Window's internal Unicode representation, UTF-16(LE). This can be done using Encode::encode(). Having used this call (say) `## Why oh why does it try to modify the input!? my $uPath = encode( 'UTF-16LE', $_ = "\\\\?\\c:\\some\\path", 1 );` [download] The scalar `$uPath` will contain the UTF-16LE representation of the input. `use Encode; print encode( 'UTF-16LE', $_ = "\\\\?\\C:\\some\\path", 1 );; \ \ ? \ C : \ s o m e \ p a t h $uPath = encode( 'UTF-16LE', $_ = "\\\\?\\C:\\some\\path", 1 );; print unpack 'H', $uPath;; 5c005c003f005c0043003a005c0073006f006d0065005c007000610074006800` [download] Now I need to pass this as the first parameter to FindFirstFileW()--which is defined as `HANDLE FindFirstFileW( LPCTSTR lpFileName, LPWIN32_FIND_DATA lpFindFil +eData ) LPTCSTR => long pointer to (array of) TCHAR TCHAR => wchar_t` [download] either directly through Win32::API, or indirectly through glob or opendir; but if embedded null detection was implemented, it would warn or die because every other byte of the PV contains a null--but that is intentional and correct*. So, not only would Perl need to detect embedded nulls, it would also need to detect cases where embedded nulls are legitimate. The only way I can see that it could do that is if an additional field where added to the SV to record the encoding contained by the SV--a not inconsiderable expense. And that's completely ignoring the fact that SVs can and often do contain arbitrary binary data that can legitimately contain nulls and for which there would be no generic mechanism for flagging as exempt from embedded null byte detection. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]


Your skill will accomplish what the force of many cannot
	PerlMonks