in reply to Re^2: Portable length() in bytes.
in thread Portable length() in bytes.

In the cases you mention, you would not want to have data that perl has marked as utf8.

Replies are listed 'Best First'.
Re^4: Portable length() in bytes.
by William G. Davis (Friar) on Nov 07, 2004 at 23:19 UTC

    No, you wouldn't, but what if I have a function like this:

    send_soap_message()

    and someone calls it like this:

    send_soap_message($xml_code_encoded_as_utf8);

    and the function syswrite()'s using length(), blindly assuming it will return the right value.

    The keyword here is portability. The idea is a length()-like function that returns the length in bytes regardless of your Perl distro, which enables you to write code that targets, say, Perl 5.005 but also works with Unicode-enabled Perls 5.6.1 and up.

      See Re^3: Portable length() in bytes.. The bottom line is length will not calculate how many bytes will go out, and syswrite is expecting a character count. (Don't get upset if what you are writing is binary data; in that case you should have 8-bit characters.) See some examples (note that v255 is utf-8 encoded in perl, while "\xff" is not, and -CO tells perl that STDOUT expects utf8):
      $ perl -we'$x = v255; {use bytes; print STDERR "len:",($len=length $x),"\n" } print STDERR "wrote: ",($x = syswrite STDOUT, $x, $len),"\n"'|cat len:2 wrote: 1 ˙ $ perl -we'$x = "\xff"; {use bytes; print STDERR "len:",($len=length $x),"\n" } print STDERR "wrote: ",($x = syswrite STDOUT, $x, $len),"\n"'|cat len:1 wrote: 1 ˙ $ perl -CO -we'$x="\xff"; {use bytes;print STDERR "len:",($ln=length $x),"\n" } print STDERR "wrote: ",($x = syswrite STDOUT, $x, $len),"\n"'|cat len:1 wrote: 1 Aż $ perl -CO -we'$x=v255; {use bytes; print STDERR "len:",($len=length $x),"\n" } print STDERR "wrote: ",($x = syswrite STDOUT, $x, $len),"\n"'|cat len:2 wrote: 1 Aż
      The "length" passed to syswrite is useless; it expects and returns character length and offset. And whether the string being output is 1 byte or 2 bytes, it's just one character, and will be output as either 1 or 2 bytes depending on the output filehandle, not on how perl has it encoded.

        Correct me if I'm wrong, but what you're saying is, if you syswrite() to a file handle binmode()'d as :utf8 and you write UTF-8 characters using length(), everything works fine, because syswrite() will interpret the length parameter to be the length in UTF characters, not bytes?

        First, what you're talking about only works with Perl 5.8+. Prior versions of Perl do not have the :utf8 binmode. Then you said this, which stumped me:

        The "length" passed to syswrite is useless; it expects and returns character length and offset.

        Well, here's what 5.8's perldoc -f syswrite says:

        syswrite FILEHANDLE,SCALAR,LENGTH,OFFSET
        syswrite FILEHANDLE,SCALAR,LENGTH
        syswrite FILEHANDLE,SCALAR
        
            Attempts to write LENGTH bytes of data from
        variable SCALAR to the specified FILEHANDLE, using the
        system call write(2). If LENGTH is not specified, writes
        whole SCALAR. It bypasses buffered IO, so mixing this with
        reads (other than sysread()), print, write, seek, tell, or
        eof may cause confusion because the perlio and stdio layers
        usually buffers data. Returns the number of bytes
        actually written, or undef if there was an error (in this
        case the errno variable $! is also set). If the LENGTH is
        greater than the available data in the SCALAR after the
        OFFSET, only as much data as is available will be written.
        
            An OFFSET may be specified to write the data from some
        part of the string other than the beginning. A negative
        OFFSET specifies writing that many characters counting
        backwards from the end of the string. In the case the SCALAR
        is empty you can use OFFSET but only zero offset.
        
            Note that if the filehandle has been marked as :utf8,
        Unicode characters are written instead of bytes (the LENGTH,
        OFFSET, and the return value of syswrite() are in UTF-8
        encoded Unicode characters). The :encoding(...) layer
        implicitly introduces the :utf8 layer. See "binmode",
        "open", and the open pragma, open.
        

        Which means under 5.8 you can get away with slipping syswrite() UTF-8 strings (and you can also drop the LENGTH parameter all together, as it's been optional since 5.6.1), but that still doesn't address the issue of portability.

        Can you guarantee me that this bit of code:

        my $bytes_written = syswrite($self->socket, $data, length $data);

        will work with any version of perl going back to 5.005? (Note the word "Portable" in the node title.)

        Here's an example that seems to break under 5.6.1, unless I'm missing something:

        my $string = "\x{263a}\x{263a}\x{263a}"; { use bytes; syswrite(STDOUT, $string, length $string); } syswrite(STDOUT, "\n", 1); syswrite(STDOUT, $string, length $string);
        _˙¦_˙¦_˙¦
        _˙¦
        

        It seems like the third syswrite() is getting back 3 from length(), meaning three characters, which syswrite() interprets to be 3 bytes, so only the first smiley face gets written. I binmode()'d STDOUT and still got the same thing.