comment on

Lookahead alone won't do because the pair might be cut into two reads. It does make things more complicated.

I don't know anything about surrogates. I assumed the following:

hi followed by lo = ok
hi not followed by lo = bad
lo not preceeded by hi = bad

#!/usr/bin/perl

# usage:
#    fix_surrogates.pl < infile > outfile

# Hi Surrogate: D800-DBFF
# Lo Surrogate: DC00-DFFF

use strict;
use warnings;

binmode STDIN;   # Disable :crlf
binmode STDOUT;  # Disable :crlf

my $read_size = 16*1024;

my $valid_pat   = qr/ .[^\xD8-\xDF]
                    | .[\xD8-\xDB].[\xDC-\xDF]
                    /xs;

my $invalid_pat = qr/ .[\xDC-\xDF]
                    | .[\xD8-\xDB](?=.[^\xDC-\xDF])
                    /xs;

my $ibuf = '';
my $obuf = '';

for (;;) {
   my $rv = read(STDIN, $ibuf, $read_size, length($ibuf));
   die("$!\n") if !defined($rv);
   last if !$rv;

   for ($ibuf) {
      /\G ($valid_pat+) /xgc && do { $obuf .= $1;              };
      /\G $invalid_pat  /xgc && do { $obuf .= "\xFD\xFF"; redo };
   }

   print($obuf);

   $ibuf = substr($ibuf, pos($ibuf)||0);
   $obuf = '';
}

$ibuf =~ s/..?/\xFD\xFF/sg;   
print($ibuf);
[download]

Update: Tested. Fixed character class that wasn't negated as it should have been.

>type testdata.pl
binmode STDOUT;
my $hi = "\xF4\xDB";
my $lo = "\xE2\xDE";
print "a\0" . $hi . $lo   . "b\0" . "\n\0",
      "c\0" . $hi . "c\0" . "d\0" . "\n\0",
      "e\0" . $lo . "f\0" . "g\0" . "\n\0";

>perl testdata.pl | perl fix_surrogates.pl | perl -0777 -pe"BEGIN { bi
+nmode STDIN, ':encoding(UTF-16le)'; binmode STDOUT, ':encoding(US-ASC
+II)' }"
"\x{10d2e2}" does not map to ascii, <> chunk 1.
"\x{fffd}" does not map to ascii, <> chunk 1.
"\x{fffd}" does not map to ascii, <> chunk 1.
a\x{10d2e2}b
c\x{fffd}cd
e\x{fffd}fg
[download]

In reply to Re^3: Handling malformed UTF-16 data with PerlIO layer by ikegami
in thread Handling malformed UTF-16 data with PerlIO layer by almut

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.