However, if one of the many buffers involved (remote libC, remote kernel, remote sshd, remote TCP stack, switch, local TCP stack, local kernel, local ssh, local libC, AnyEvent's sysread) manages to split a UTF-8 character, there is the concern that the utf8 layer will not handle this
Do I read "concernt that the utf8 layer will not handle this" correctly as "you are worried, but haven't observed the problem so far"?
I for one would not be concerned unless the problem really occured, and trust perl's IO layer.
In fact I've made a very simple test for this situation:
$ perl -MEncode=encode_utf8 -wE '$| = 1; my $buf = encode_utf8 chr(0xe +5); print substr($buf, 0, 1); sleep 1; say substr($buf, 1)' | perl -C +S -pe 1 å
This splits the å into two bytes, writes the first, sleeps a second, and then writes the second byte plus a newline. The perl process reading from the pipe decodes the input as UTF-8 (that's what the -CS does), and prints it to STDOUT again. Works fine.
$buf =~ s/^((?:[\x00-\x7f]+|[\xc0-0xff][\x80-\xbf]+)*)//; my $newtext = $1; utf8::decode($newtext); $text_so_far .= $newtext;
The regex doesn't look right to me. If you have a character that is encoded as three or more bytes, the [\xc0-0xff][\x80-\xbf]+ part could match only the first two bytes, and you wouldn't detect if the third was missing.
In reply to Re: incremental reading of utf8 input handles
by moritz
in thread incremental reading of utf8 input handles
by Tanktalus
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |