comment on

However, if one of the many buffers involved (remote libC, remote kernel, remote sshd, remote TCP stack, switch, local TCP stack, local kernel, local ssh, local libC, AnyEvent's sysread) manages to split a UTF-8 character, there is the concern that the utf8 layer will not handle this

Do I read "concernt that the utf8 layer will not handle this" correctly as "you are worried, but haven't observed the problem so far"?

I for one would not be concerned unless the problem really occured, and trust perl's IO layer.

In fact I've made a very simple test for this situation:

$ perl -MEncode=encode_utf8 -wE '$| = 1; my $buf = encode_utf8 chr(0xe
+5); print substr($buf, 0, 1); sleep 1; say substr($buf, 1)' | perl -C
+S -pe 1
å
[download]

This splits the å into two bytes, writes the first, sleeps a second, and then writes the second byte plus a newline. The perl process reading from the pipe decodes the input as UTF-8 (that's what the -CS does), and prints it to STDOUT again. Works fine.

$buf =~ s/^((?:[\x00-\x7f]+|[\xc0-0xff][\x80-\xbf]+)*)//;
my $newtext = $1;
utf8::decode($newtext);
$text_so_far .= $newtext;
[download]

The regex doesn't look right to me. If you have a character that is encoded as three or more bytes, the [\xc0-0xff][\x80-\xbf]+ part could match only the first two bytes, and you wouldn't detect if the third was missing.

Perl 6 - the future is here, just unevenly distributed

In reply to Re: incremental reading of utf8 input handles by moritz
in thread incremental reading of utf8 input handles by Tanktalus

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.