Re^6: Handling malformed UTF-16 data with PerlIO layer

There is no such thing as "private-use high-surrogates".

Well, I was referring to (quote from p. 548, section 16.6, Unicode Standard v5.0 — which I linked to in the original post):

Private-Use High-Surrogates. The high-surrogate code points from U+DB80..U+DBFF are private-use high-surrogate code points (a total of 128 code points). Characters represented by means of a surrogate pair, where the high-surrogate code point is a private-use high-surrogate, are private-use characters from the supplementary private use areas. For more information on private-use characters, see Section 16.5, Private-Use Characters.

though I wasn't just referring to those 128 code points, but rather to the wider context of the respective surrogate pairs, and how they would be used in practice.

Anyhow, things like you (an expert) denying the existence of private-use high-surrogates, kinda confirms what I'm saying :) Encodings like UTF-16 are non-trivial enough for me to not necessarily want to get into every detail of it if there's some way around (though, as it looks, there doesn't seem to be...). Rather, I'd like to rely on the good work already done within Perl by people like our honorable Juerd. After all, what's the point of having support for unicode and other encodings in Perl, if you then write your own parsers from scratch?

eval { $u = decode( "UTF-16LE", $c, Encode::FB_WARN ) };
[download]

Interestingly, the current Encode docs note

Handling Malformed Data
The optional CHECK argument tells Encode what to do when it encounters malformed data. (...) NOTE: Not all encodings support this feature. Some encodings ignore CHECK argument. For example, Encode::Unicode ignores CHECK and it always croaks on error.

(Encode::Unicode implements unicode encodings like UTF-16)

AFAICT, this is partially true. That is, the CHECK argument appears to honor the value FB_DEFAULT, but croaks with anything else, which would explain why - with UTF-16 - FB_QUIET and FB_WARN do not quite produce the behavior you'd expect from reading the description of those constants...

Comment on Re^6: Handling malformed UTF-16 data with PerlIO layer Download Code

Replies are listed 'Best First'.
Re^7: Handling malformed UTF-16 data with PerlIO layer by graff (Chancellor) on Oct 28, 2008 at 22:24 UTC
Private-Use High-Surrogates. The high-surrogate code points from U+DB80..U+DBFF are private-use high-surrogate code points (a total of 128 code points). Oh yeah, that's true -- I was just taking the viewpoint that the "surrogate range" as a block (as it relates to potential encoding errors) does not really need to be broken into the parts that map to the "supplemental private-use area", because this area is just part of the "higher planes" in the unicode space, and is addressed by surrogates in the same way as all the other planes above FFFF. (Encode::Unicode implements unicode encodings like UTF-16) Thanks for clarifying that -- this thread has been very educational for me.	[reply]

Replies are listed 'Best First'.

Re^7: Handling malformed UTF-16 data with PerlIO layer
by graff (Chancellor) on Oct 28, 2008 at 22:24 UTC

Private-Use High-Surrogates. The high-surrogate code points from U+DB80..U+DBFF are private-use high-surrogate code points (a total of 128 code points).

Oh yeah, that's true -- I was just taking the viewpoint that the "surrogate range" as a block (as it relates to potential encoding errors) does not really need to be broken into the parts that map to the "supplemental private-use area", because this area is just part of the "higher planes" in the unicode space, and is addressed by surrogates in the same way as all the other planes above FFFF.

(Encode::Unicode implements unicode encodings like UTF-16)

Thanks for clarifying that -- this thread has been very educational for me.

[reply]