in reply to Re^3: Handling malformed UTF-16 data with PerlIO layer
in thread Handling malformed UTF-16 data with PerlIO layer
Thank you very much, again, for actually working out the details. I think I'll go with that approach — unless someone has a better suggestion...
That said, my gut feelings of unease still hold about reimplementing a parser for an encoding I possibly have not fully understood (e.g. what are private-use high-surrogates, really? ...and who knows what else there might be).
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: Handling malformed UTF-16 data with PerlIO layer
by graff (Chancellor) on Oct 28, 2008 at 06:33 UTC | |
There is no such thing as "private-use high-surrogates". There is a region of the unicode space reserved for "private use" (from E000 thru F8FF), and there is the region set aside for "surrogates" (from D800 thru DFFF). There's also a "supplementary private use" area running from F0000 - 10FFFF, which is not relevant here (note the extra digits). There is no "supplemental surrogates" area -- the surrogate region is "special" and unique, reserved specifically so that UTF-16 encodings have a way of representing code points above FFFF (in much the same way that byte-oriented utf8 handles code points above FF). In effect, UTF-16 is a "variable-width" encoding in the case where code points above FFFF are being used -- such "higher-plane" code points must be expressed via two UTF-16 values. Since the very highest Unicode code point is 10FFFF (21 bits), and since the high 5 bits are only used for 16 distinct "upper planes" (01....-10...., hence 4 bits worth), the surrogate region provides for the 20 "significant" bits to be split over two 16-bit words, where the high 6 bits of each word are rigidly fixed: first word of a surrogate pair must have 110110 (D800-DBFF for the "High" 10 bits), second word must have 110111 (DC00-DFFF for the "Low" 10 bits). This serves to explain why you cannot convert a 16-bit value in the surrogate range into a utf8 character -- no characters (no code points) can be defined within that range of 16-bit values. But when a code point above FFFF is correctly encoded into UTF-16, you get surrogates (a pair of 16-bit values, one each in the "High" and "Low" regions of the surrogate range). Regarding ikegami's observation about FFFE and FFFF, I noticed that this is a difference between 5.8.8 and 5.10.0 -- Encode handles these code points in 5.8 but it spits out the error in 5.10. It's certainly true that Unicode explicitly reserves these values as "non-characters." I'm not sure whether 5.8 or 5.10 has the better approach, and I sort of expect that it might depend on the circumstances. I looked for something about this in perldelta, but didn't see anything explicit.
In addition to those two "non-character" code points, the same result applies to the range FDD0 - FDEF. According to the unicode reference page, "These codes are intended for process-internal uses, but are
not permitted for interchange." I don't really know what In any case, here's a test script for identifying all the unsavory (error-inducing) 16-bit values -- you can run this in both 5.8.8 and 5.10.0 to see how the two versions differ in their behavior. I think the "eval" technique here might be a decent approach for what you need to do with your data -- I'm afraid you'll need to ditch the idea of using the PerlIO::encoding layer, and should probably go with reading into a fixed-sized buffer, Check out the description of FB_WARN in the Encode man page, because it handles the case where you are doing fixed-size buffer reads and get a partial character at the end of a given buffer.
| [reply] [d/l] |
by almut (Canon) on Oct 28, 2008 at 20:12 UTC | |
There is no such thing as "private-use high-surrogates". Well, I was referring to (quote from p. 548, section 16.6, Unicode Standard v5.0 — which I linked to in the original post):
though I wasn't just referring to those 128 code points, but rather to the wider context of the respective surrogate pairs, and how they would be used in practice. Anyhow, things like you (an expert) denying the existence of private-use high-surrogates, kinda confirms what I'm saying :) Encodings like UTF-16 are non-trivial enough for me to not necessarily want to get into every detail of it if there's some way around (though, as it looks, there doesn't seem to be...). Rather, I'd like to rely on the good work already done within Perl by people like our honorable Juerd. After all, what's the point of having support for unicode and other encodings in Perl, if you then write your own parsers from scratch?
Interestingly, the current Encode docs note Handling Malformed Data (Encode::Unicode implements unicode encodings like UTF-16) AFAICT, this is partially true. That is, the CHECK argument appears to honor the value FB_DEFAULT, but croaks with anything else, which would explain why - with UTF-16 - FB_QUIET and FB_WARN do not quite produce the behavior you'd expect from reading the description of those constants... | [reply] [d/l] |
by graff (Chancellor) on Oct 28, 2008 at 22:24 UTC | |
Oh yeah, that's true -- I was just taking the viewpoint that the "surrogate range" as a block (as it relates to potential encoding errors) does not really need to be broken into the parts that map to the "supplemental private-use area", because this area is just part of the "higher planes" in the unicode space, and is addressed by surrogates in the same way as all the other planes above FFFF. (Encode::Unicode implements unicode encodings like UTF-16) Thanks for clarifying that -- this thread has been very educational for me. | [reply] |
by ikegami (Patriarch) on Oct 28, 2008 at 10:13 UTC | |
The error messages I got were from 5.8.8. I don't see any different between 5.8 and 5.10.
| [reply] [d/l] |
|
Re^5: Handling malformed UTF-16 data with PerlIO layer
by ikegami (Patriarch) on Oct 28, 2008 at 02:34 UTC | |
U+FFFE and U+FFFF are invalid.
Same for UCS-2. There could be more. | [reply] [d/l] [select] |