in reply to Portable length() in bytes.

I'm really baffled by this; why would you want length in bytes?

Replies are listed 'Best First'.
Re^2: Portable length() in bytes.
by William G. Davis (Friar) on Nov 07, 2004 at 21:11 UTC

    Well, think about it for a moment. What if your scalar contains arbitrary binary data like a JPEG or *.tar.gz file? You don't want the length in "characters" for it because random byte sequences could get mistaken for multi-byte UTF-8 characters, resulting in a shorter length() than you expected.

    A better example is UTF-8 itself. Tell me, how would you send a UTF-8 string with multi-byte UTF-8 characters in it over a network? In addition, how could you do it portably, so your code would work back to version 5.005 of Perl?

    # five UTF smiley faces (three bytes long each): my $string = "\x{263a}\x{263a}\x{263a}\x{263a}\x{263a}"; my $bytes_written = syswrite($socket, $string, length $string);

    Oops. That ends up writing only five bytes to the socket instead of fifteen, because length() returns the length in characters, not bytes, and each of those smiley faces takes up three bytes.

    Use size_in_bytes() instead and it works regardless of what Perl you're using:

    # five UTF smiley faces (three bytes long each): my $string = "\x{263a}\x{263a}\x{263a}\x{263a}\x{263a}"; my $bytes_written = syswrite($socket, $string, size_in_bytes($string)) +;
      In the cases you mention, you would not want to have data that perl has marked as utf8.

        No, you wouldn't, but what if I have a function like this:

        send_soap_message()

        and someone calls it like this:

        send_soap_message($xml_code_encoded_as_utf8);

        and the function syswrite()'s using length(), blindly assuming it will return the right value.

        The keyword here is portability. The idea is a length()-like function that returns the length in bytes regardless of your Perl distro, which enables you to write code that targets, say, Perl 5.005 but also works with Unicode-enabled Perls 5.6.1 and up.

Re^2: Portable length() in bytes.
by thor (Priest) on Nov 07, 2004 at 20:44 UTC
    Regardless of the underlying encoding, the computer still deals with these things as bytes. Storage doesn't care whether the stuff you're storing is UTF-8 or ASCII. Nor does transmission over the network. Bytes are still a useful measure of quantity in some domains.

    thor

    Feel the white light, the light within
    Be your own disciple, fan the sparks of will
    For all of us waiting, your kingdom will come

      But if you have to know how much you are storing or transmitting, you need to know what your output file handle is going to do with the data. If the output file handle will be upgrading to utf8, and your data is "\xff123" (4 bytes, 4 characters, in 8-bit encoding), 5 bytes will be written. If the output filehandle downgrades utf8 and you have "\x{ff}123" (4 bytes, 5 characters, in utf8 encoding), you will be writing just 4 bytes. But how long it is in the encoding perl happens to have it stored as is not relevant.
Re^2: Portable length() in bytes.
by DrHyde (Prior) on Nov 08, 2004 at 10:13 UTC
    That's almost as stupid as asking why would you want a pointer instead of a reference, why would you want an int as opposed to a float, or why you would want a hash instead of two parallel arrays of "keys" and "values".
      Almost but not quite. Thanks for contributing to the conversation.
        Glad that you agree that it is at least somewhat stupid :-)