comment on

I was hoping that a test like this would point the way to a decent solution, but having tried it on 5.8.8, it doesn't show the results I would want:

use strict;
use Encode;
binmode STDOUT, ":utf8";

my %test_sets = (
    normal =>   [ 0x40 .. 0x7f ],  # normal ascii range of 64 characte
+rs
  # puta some surrogate data on a record boundary:
    oksplit =>  [ 0x40 .. 0x5e, 0xd801, 0xdc01, 0x61 .. 0x7f ],  # goo
+d surrog. pair
    danglehi => [ 0x40 .. 0x5e, 0xd801, 0x60 .. 0x7f ], # bad: missing
+ Lo surrog.
    danglelo => [ 0x40 .. 0x5e, 0xdc01, 0x60 .. 0x7f ], # bad: missing
+ Hi surrog.
    invsplit => [ 0x40 .. 0x5e, 0xdc01, 0xd801, 0x61 .. 0x7f ],  # two
+ surrog. errors
  # same as above, but not on a record boundary:
    okmid =>    [ 0x40 .. 0x4e, 0xd801, 0xdc01, 0x51 .. 0x7f ],  # goo
+d surrog. pair
    strandhi => [ 0x40 .. 0x4e, 0xd801, 0x50 .. 0x7f ], # bad: missing
+ Lo surrog.
    strandlo => [ 0x40 .. 0x4e, 0xdc01, 0x50 .. 0x7f ], # bad: missing
+ Hi surrog.
    invmid =>   [ 0x40 .. 0x4e, 0xdc01, 0xd801, 0x51 .. 0x7f ],  # two
+ surrog. errors
    );

for my $type ( qw/normal okmid oksplit
                  strandhi strandlo invmid
                  danglehi danglelo invsplit/ ) {
    warn "\nRunning test on $type;\n";
    print "\nRunning test on $type:\n";
    my $string = pack( 'v*', @{$test_sets{$type}} );
    my $u = '';
    {
        open my $fh, "<", \$string or die $!;
        my $pass = 1;
        $_ = '';
        while ( read( $fh, $_, 64, length())) {
            eval { $u .= decode( "UTF-16LE", $_, Encode::FB_WARN ) };
            if ( $@ ) {
                warn sprintf( "on pass %d: %s; leaving %d bytes: \n",
                              $pass, $@, length(), join( " ", unpack( 
+"v*", $_ )));
            }
            $pass++;
        }
    }
    print "\n$u\n";
}
[download]

If you try that out (redirecting stdout to a file), you'll see that valid surrogate pairs are handled nicely, whether they are record-internal or split across consecutive records, But if there is an improper surrogate anywhere in a given string, "decode()" does not return anything at all, and the entire string is left unprocessed.

It looks to me like you'll need to use ikegami's approach of fixing the data to remove bad surrogate values before you try to decode from utf-16 to utf-8. Or at least, you'll need to use an eval block like the one above, and fix the input string whenever $@ indicates a surrogate error.

In reply to Re: Handling malformed UTF-16 data with PerlIO layer by graff
in thread Handling malformed UTF-16 data with PerlIO layer by almut

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.