in reply to Re^4: Portable length() in bytes.
in thread Portable length() in bytes.

See Re^3: Portable length() in bytes.. The bottom line is length will not calculate how many bytes will go out, and syswrite is expecting a character count. (Don't get upset if what you are writing is binary data; in that case you should have 8-bit characters.) See some examples (note that v255 is utf-8 encoded in perl, while "\xff" is not, and -CO tells perl that STDOUT expects utf8):
$ perl -we'$x = v255; {use bytes; print STDERR "len:",($len=length $x),"\n" } print STDERR "wrote: ",($x = syswrite STDOUT, $x, $len),"\n"'|cat len:2 wrote: 1 ˙ $ perl -we'$x = "\xff"; {use bytes; print STDERR "len:",($len=length $x),"\n" } print STDERR "wrote: ",($x = syswrite STDOUT, $x, $len),"\n"'|cat len:1 wrote: 1 ˙ $ perl -CO -we'$x="\xff"; {use bytes;print STDERR "len:",($ln=length $x),"\n" } print STDERR "wrote: ",($x = syswrite STDOUT, $x, $len),"\n"'|cat len:1 wrote: 1 Aż $ perl -CO -we'$x=v255; {use bytes; print STDERR "len:",($len=length $x),"\n" } print STDERR "wrote: ",($x = syswrite STDOUT, $x, $len),"\n"'|cat len:2 wrote: 1 Aż
The "length" passed to syswrite is useless; it expects and returns character length and offset. And whether the string being output is 1 byte or 2 bytes, it's just one character, and will be output as either 1 or 2 bytes depending on the output filehandle, not on how perl has it encoded.

Replies are listed 'Best First'.
Re^6: Portable length() in bytes.
by William G. Davis (Friar) on Nov 08, 2004 at 00:37 UTC

    Correct me if I'm wrong, but what you're saying is, if you syswrite() to a file handle binmode()'d as :utf8 and you write UTF-8 characters using length(), everything works fine, because syswrite() will interpret the length parameter to be the length in UTF characters, not bytes?

    First, what you're talking about only works with Perl 5.8+. Prior versions of Perl do not have the :utf8 binmode. Then you said this, which stumped me:

    The "length" passed to syswrite is useless; it expects and returns character length and offset.

    Well, here's what 5.8's perldoc -f syswrite says:

    syswrite FILEHANDLE,SCALAR,LENGTH,OFFSET
    syswrite FILEHANDLE,SCALAR,LENGTH
    syswrite FILEHANDLE,SCALAR
    
        Attempts to write LENGTH bytes of data from
    variable SCALAR to the specified FILEHANDLE, using the
    system call write(2). If LENGTH is not specified, writes
    whole SCALAR. It bypasses buffered IO, so mixing this with
    reads (other than sysread()), print, write, seek, tell, or
    eof may cause confusion because the perlio and stdio layers
    usually buffers data. Returns the number of bytes
    actually written, or undef if there was an error (in this
    case the errno variable $! is also set). If the LENGTH is
    greater than the available data in the SCALAR after the
    OFFSET, only as much data as is available will be written.
    
        An OFFSET may be specified to write the data from some
    part of the string other than the beginning. A negative
    OFFSET specifies writing that many characters counting
    backwards from the end of the string. In the case the SCALAR
    is empty you can use OFFSET but only zero offset.
    
        Note that if the filehandle has been marked as :utf8,
    Unicode characters are written instead of bytes (the LENGTH,
    OFFSET, and the return value of syswrite() are in UTF-8
    encoded Unicode characters). The :encoding(...) layer
    implicitly introduces the :utf8 layer. See "binmode",
    "open", and the open pragma, open.
    

    Which means under 5.8 you can get away with slipping syswrite() UTF-8 strings (and you can also drop the LENGTH parameter all together, as it's been optional since 5.6.1), but that still doesn't address the issue of portability.

    Can you guarantee me that this bit of code:

    my $bytes_written = syswrite($self->socket, $data, length $data);

    will work with any version of perl going back to 5.005? (Note the word "Portable" in the node title.)

    Here's an example that seems to break under 5.6.1, unless I'm missing something:

    my $string = "\x{263a}\x{263a}\x{263a}"; { use bytes; syswrite(STDOUT, $string, length $string); } syswrite(STDOUT, "\n", 1); syswrite(STDOUT, $string, length $string);
    _˙¦_˙¦_˙¦
    _˙¦
    

    It seems like the third syswrite() is getting back 3 from length(), meaning three characters, which syswrite() interprets to be 3 bytes, so only the first smiley face gets written. I binmode()'d STDOUT and still got the same thing.

      Use utf8 data on 5.6.x (and even 5.8.0) at your peril. In the earlier 5.8.x versions there were steadily decreasing numbers of utf8-related bugs, but 5.6.x's problem is not only bugs but a bad paradigm. If you must use modules which return utf8 data on 5.6.1, I can only suggest that you decontaminate it at once. perlunicode on 5.6.x says
      WARNING: As of the 5.6.1 release, the implementation of Unicode support in Perl is incomplete, and continues to be highly experimental.
      and really means it.

        WARNING: As of the 5.6.1 release, the implementation of Unicode support in Perl is incomplete, and continues to be highly experimental.

        It may be experimental, but it's still quite usable, and for better or worse, people have been using it for years now.

        Look, it would be nice if everyone used 5.8 and I could just ignore 5.6.1, but sadly, I can't, and neither should other module writers. This length() replacement is the only portable way I've found to write byte-oriented network code that works with versions of Perl going back to 5.005. (POE uses a similar scheme to accomplish the same thing, and you can see the results here.)