in reply to Re^3: Portable length() in bytes.
in thread Portable length() in bytes.

No, you wouldn't, but what if I have a function like this:

send_soap_message()

and someone calls it like this:

send_soap_message($xml_code_encoded_as_utf8);

and the function syswrite()'s using length(), blindly assuming it will return the right value.

The keyword here is portability. The idea is a length()-like function that returns the length in bytes regardless of your Perl distro, which enables you to write code that targets, say, Perl 5.005 but also works with Unicode-enabled Perls 5.6.1 and up.

Replies are listed 'Best First'.
Re^5: Portable length() in bytes.
by ysth (Canon) on Nov 07, 2004 at 23:34 UTC
    See Re^3: Portable length() in bytes.. The bottom line is length will not calculate how many bytes will go out, and syswrite is expecting a character count. (Don't get upset if what you are writing is binary data; in that case you should have 8-bit characters.) See some examples (note that v255 is utf-8 encoded in perl, while "\xff" is not, and -CO tells perl that STDOUT expects utf8):
    $ perl -we'$x = v255; {use bytes; print STDERR "len:",($len=length $x),"\n" } print STDERR "wrote: ",($x = syswrite STDOUT, $x, $len),"\n"'|cat len:2 wrote: 1 ˙ $ perl -we'$x = "\xff"; {use bytes; print STDERR "len:",($len=length $x),"\n" } print STDERR "wrote: ",($x = syswrite STDOUT, $x, $len),"\n"'|cat len:1 wrote: 1 ˙ $ perl -CO -we'$x="\xff"; {use bytes;print STDERR "len:",($ln=length $x),"\n" } print STDERR "wrote: ",($x = syswrite STDOUT, $x, $len),"\n"'|cat len:1 wrote: 1 Aż $ perl -CO -we'$x=v255; {use bytes; print STDERR "len:",($len=length $x),"\n" } print STDERR "wrote: ",($x = syswrite STDOUT, $x, $len),"\n"'|cat len:2 wrote: 1 Aż
    The "length" passed to syswrite is useless; it expects and returns character length and offset. And whether the string being output is 1 byte or 2 bytes, it's just one character, and will be output as either 1 or 2 bytes depending on the output filehandle, not on how perl has it encoded.

      Correct me if I'm wrong, but what you're saying is, if you syswrite() to a file handle binmode()'d as :utf8 and you write UTF-8 characters using length(), everything works fine, because syswrite() will interpret the length parameter to be the length in UTF characters, not bytes?

      First, what you're talking about only works with Perl 5.8+. Prior versions of Perl do not have the :utf8 binmode. Then you said this, which stumped me:

      The "length" passed to syswrite is useless; it expects and returns character length and offset.

      Well, here's what 5.8's perldoc -f syswrite says:

      syswrite FILEHANDLE,SCALAR,LENGTH,OFFSET
      syswrite FILEHANDLE,SCALAR,LENGTH
      syswrite FILEHANDLE,SCALAR
      
          Attempts to write LENGTH bytes of data from
      variable SCALAR to the specified FILEHANDLE, using the
      system call write(2). If LENGTH is not specified, writes
      whole SCALAR. It bypasses buffered IO, so mixing this with
      reads (other than sysread()), print, write, seek, tell, or
      eof may cause confusion because the perlio and stdio layers
      usually buffers data. Returns the number of bytes
      actually written, or undef if there was an error (in this
      case the errno variable $! is also set). If the LENGTH is
      greater than the available data in the SCALAR after the
      OFFSET, only as much data as is available will be written.
      
          An OFFSET may be specified to write the data from some
      part of the string other than the beginning. A negative
      OFFSET specifies writing that many characters counting
      backwards from the end of the string. In the case the SCALAR
      is empty you can use OFFSET but only zero offset.
      
          Note that if the filehandle has been marked as :utf8,
      Unicode characters are written instead of bytes (the LENGTH,
      OFFSET, and the return value of syswrite() are in UTF-8
      encoded Unicode characters). The :encoding(...) layer
      implicitly introduces the :utf8 layer. See "binmode",
      "open", and the open pragma, open.
      

      Which means under 5.8 you can get away with slipping syswrite() UTF-8 strings (and you can also drop the LENGTH parameter all together, as it's been optional since 5.6.1), but that still doesn't address the issue of portability.

      Can you guarantee me that this bit of code:

      my $bytes_written = syswrite($self->socket, $data, length $data);

      will work with any version of perl going back to 5.005? (Note the word "Portable" in the node title.)

      Here's an example that seems to break under 5.6.1, unless I'm missing something:

      my $string = "\x{263a}\x{263a}\x{263a}"; { use bytes; syswrite(STDOUT, $string, length $string); } syswrite(STDOUT, "\n", 1); syswrite(STDOUT, $string, length $string);
      _˙¦_˙¦_˙¦
      _˙¦
      

      It seems like the third syswrite() is getting back 3 from length(), meaning three characters, which syswrite() interprets to be 3 bytes, so only the first smiley face gets written. I binmode()'d STDOUT and still got the same thing.

        Use utf8 data on 5.6.x (and even 5.8.0) at your peril. In the earlier 5.8.x versions there were steadily decreasing numbers of utf8-related bugs, but 5.6.x's problem is not only bugs but a bad paradigm. If you must use modules which return utf8 data on 5.6.1, I can only suggest that you decontaminate it at once. perlunicode on 5.6.x says
        WARNING: As of the 5.6.1 release, the implementation of Unicode support in Perl is incomplete, and continues to be highly experimental.
        and really means it.