in reply to Re: Portable length() in bytes.
in thread Portable length() in bytes.

Well, think about it for a moment. What if your scalar contains arbitrary binary data like a JPEG or *.tar.gz file? You don't want the length in "characters" for it because random byte sequences could get mistaken for multi-byte UTF-8 characters, resulting in a shorter length() than you expected.

A better example is UTF-8 itself. Tell me, how would you send a UTF-8 string with multi-byte UTF-8 characters in it over a network? In addition, how could you do it portably, so your code would work back to version 5.005 of Perl?

# five UTF smiley faces (three bytes long each): my $string = "\x{263a}\x{263a}\x{263a}\x{263a}\x{263a}"; my $bytes_written = syswrite($socket, $string, length $string);

Oops. That ends up writing only five bytes to the socket instead of fifteen, because length() returns the length in characters, not bytes, and each of those smiley faces takes up three bytes.

Use size_in_bytes() instead and it works regardless of what Perl you're using:

# five UTF smiley faces (three bytes long each): my $string = "\x{263a}\x{263a}\x{263a}\x{263a}\x{263a}"; my $bytes_written = syswrite($socket, $string, size_in_bytes($string)) +;

Replies are listed 'Best First'.
Re^3: Portable length() in bytes.
by ysth (Canon) on Nov 07, 2004 at 21:55 UTC
    In the cases you mention, you would not want to have data that perl has marked as utf8.

      No, you wouldn't, but what if I have a function like this:

      send_soap_message()

      and someone calls it like this:

      send_soap_message($xml_code_encoded_as_utf8);

      and the function syswrite()'s using length(), blindly assuming it will return the right value.

      The keyword here is portability. The idea is a length()-like function that returns the length in bytes regardless of your Perl distro, which enables you to write code that targets, say, Perl 5.005 but also works with Unicode-enabled Perls 5.6.1 and up.

        See Re^3: Portable length() in bytes.. The bottom line is length will not calculate how many bytes will go out, and syswrite is expecting a character count. (Don't get upset if what you are writing is binary data; in that case you should have 8-bit characters.) See some examples (note that v255 is utf-8 encoded in perl, while "\xff" is not, and -CO tells perl that STDOUT expects utf8):
        $ perl -we'$x = v255; {use bytes; print STDERR "len:",($len=length $x),"\n" } print STDERR "wrote: ",($x = syswrite STDOUT, $x, $len),"\n"'|cat len:2 wrote: 1 ˙ $ perl -we'$x = "\xff"; {use bytes; print STDERR "len:",($len=length $x),"\n" } print STDERR "wrote: ",($x = syswrite STDOUT, $x, $len),"\n"'|cat len:1 wrote: 1 ˙ $ perl -CO -we'$x="\xff"; {use bytes;print STDERR "len:",($ln=length $x),"\n" } print STDERR "wrote: ",($x = syswrite STDOUT, $x, $len),"\n"'|cat len:1 wrote: 1 Aż $ perl -CO -we'$x=v255; {use bytes; print STDERR "len:",($len=length $x),"\n" } print STDERR "wrote: ",($x = syswrite STDOUT, $x, $len),"\n"'|cat len:2 wrote: 1 Aż
        The "length" passed to syswrite is useless; it expects and returns character length and offset. And whether the string being output is 1 byte or 2 bytes, it's just one character, and will be output as either 1 or 2 bytes depending on the output filehandle, not on how perl has it encoded.