Re: Handling malformed UTF-16 data with PerlIO layer

I was hoping that a test like this would point the way to a decent solution, but having tried it on 5.8.8, it doesn't show the results I would want:

use strict;
use Encode;
binmode STDOUT, ":utf8";

my %test_sets = (
    normal =>   [ 0x40 .. 0x7f ],  # normal ascii range of 64 characte
+rs
  # puta some surrogate data on a record boundary:
    oksplit =>  [ 0x40 .. 0x5e, 0xd801, 0xdc01, 0x61 .. 0x7f ],  # goo
+d surrog. pair
    danglehi => [ 0x40 .. 0x5e, 0xd801, 0x60 .. 0x7f ], # bad: missing
+ Lo surrog.
    danglelo => [ 0x40 .. 0x5e, 0xdc01, 0x60 .. 0x7f ], # bad: missing
+ Hi surrog.
    invsplit => [ 0x40 .. 0x5e, 0xdc01, 0xd801, 0x61 .. 0x7f ],  # two
+ surrog. errors
  # same as above, but not on a record boundary:
    okmid =>    [ 0x40 .. 0x4e, 0xd801, 0xdc01, 0x51 .. 0x7f ],  # goo
+d surrog. pair
    strandhi => [ 0x40 .. 0x4e, 0xd801, 0x50 .. 0x7f ], # bad: missing
+ Lo surrog.
    strandlo => [ 0x40 .. 0x4e, 0xdc01, 0x50 .. 0x7f ], # bad: missing
+ Hi surrog.
    invmid =>   [ 0x40 .. 0x4e, 0xdc01, 0xd801, 0x51 .. 0x7f ],  # two
+ surrog. errors
    );

for my $type ( qw/normal okmid oksplit
                  strandhi strandlo invmid
                  danglehi danglelo invsplit/ ) {
    warn "\nRunning test on $type;\n";
    print "\nRunning test on $type:\n";
    my $string = pack( 'v*', @{$test_sets{$type}} );
    my $u = '';
    {
        open my $fh, "<", \$string or die $!;
        my $pass = 1;
        $_ = '';
        while ( read( $fh, $_, 64, length())) {
            eval { $u .= decode( "UTF-16LE", $_, Encode::FB_WARN ) };
            if ( $@ ) {
                warn sprintf( "on pass %d: %s; leaving %d bytes: \n",
                              $pass, $@, length(), join( " ", unpack( 
+"v*", $_ )));
            }
            $pass++;
        }
    }
    print "\n$u\n";
}
[download]

If you try that out (redirecting stdout to a file), you'll see that valid surrogate pairs are handled nicely, whether they are record-internal or split across consecutive records, But if there is an improper surrogate anywhere in a given string, "decode()" does not return anything at all, and the entire string is left unprocessed.

It looks to me like you'll need to use ikegami's approach of fixing the data to remove bad surrogate values before you try to decode from utf-16 to utf-8. Or at least, you'll need to use an eval block like the one above, and fix the input string whenever $@ indicates a surrogate error.

Comment on Re: Handling malformed UTF-16 data with PerlIO layer Download Code