in reply to Re^5: Portable length() in bytes.
in thread Portable length() in bytes.

Correct me if I'm wrong, but what you're saying is, if you syswrite() to a file handle binmode()'d as :utf8 and you write UTF-8 characters using length(), everything works fine, because syswrite() will interpret the length parameter to be the length in UTF characters, not bytes?

First, what you're talking about only works with Perl 5.8+. Prior versions of Perl do not have the :utf8 binmode. Then you said this, which stumped me:

The "length" passed to syswrite is useless; it expects and returns character length and offset.

Well, here's what 5.8's perldoc -f syswrite says:

syswrite FILEHANDLE,SCALAR,LENGTH,OFFSET
syswrite FILEHANDLE,SCALAR,LENGTH
syswrite FILEHANDLE,SCALAR

    Attempts to write LENGTH bytes of data from
variable SCALAR to the specified FILEHANDLE, using the
system call write(2). If LENGTH is not specified, writes
whole SCALAR. It bypasses buffered IO, so mixing this with
reads (other than sysread()), print, write, seek, tell, or
eof may cause confusion because the perlio and stdio layers
usually buffers data. Returns the number of bytes
actually written, or undef if there was an error (in this
case the errno variable $! is also set). If the LENGTH is
greater than the available data in the SCALAR after the
OFFSET, only as much data as is available will be written.

    An OFFSET may be specified to write the data from some
part of the string other than the beginning. A negative
OFFSET specifies writing that many characters counting
backwards from the end of the string. In the case the SCALAR
is empty you can use OFFSET but only zero offset.

    Note that if the filehandle has been marked as :utf8,
Unicode characters are written instead of bytes (the LENGTH,
OFFSET, and the return value of syswrite() are in UTF-8
encoded Unicode characters). The :encoding(...) layer
implicitly introduces the :utf8 layer. See "binmode",
"open", and the open pragma, open.

Which means under 5.8 you can get away with slipping syswrite() UTF-8 strings (and you can also drop the LENGTH parameter all together, as it's been optional since 5.6.1), but that still doesn't address the issue of portability.

Can you guarantee me that this bit of code:

my $bytes_written = syswrite($self->socket, $data, length $data);

will work with any version of perl going back to 5.005? (Note the word "Portable" in the node title.)

Here's an example that seems to break under 5.6.1, unless I'm missing something:

my $string = "\x{263a}\x{263a}\x{263a}"; { use bytes; syswrite(STDOUT, $string, length $string); } syswrite(STDOUT, "\n", 1); syswrite(STDOUT, $string, length $string);
_ÿ¦_ÿ¦_ÿ¦
_ÿ¦

It seems like the third syswrite() is getting back 3 from length(), meaning three characters, which syswrite() interprets to be 3 bytes, so only the first smiley face gets written. I binmode()'d STDOUT and still got the same thing.

Replies are listed 'Best First'.
Re^7: Portable length() in bytes.
by ysth (Canon) on Nov 08, 2004 at 05:58 UTC
    Use utf8 data on 5.6.x (and even 5.8.0) at your peril. In the earlier 5.8.x versions there were steadily decreasing numbers of utf8-related bugs, but 5.6.x's problem is not only bugs but a bad paradigm. If you must use modules which return utf8 data on 5.6.1, I can only suggest that you decontaminate it at once. perlunicode on 5.6.x says
    WARNING: As of the 5.6.1 release, the implementation of Unicode support in Perl is incomplete, and continues to be highly experimental.
    and really means it.

      WARNING: As of the 5.6.1 release, the implementation of Unicode support in Perl is incomplete, and continues to be highly experimental.

      It may be experimental, but it's still quite usable, and for better or worse, people have been using it for years now.

      Look, it would be nice if everyone used 5.8 and I could just ignore 5.6.1, but sadly, I can't, and neither should other module writers. This length() replacement is the only portable way I've found to write byte-oriented network code that works with versions of Perl going back to 5.005. (POE uses a similar scheme to accomplish the same thing, and you can see the results here.)