Around 5.6.1, length() started returning the length of a string in characters instead of the length in bytes. This now means that length() called on a multi-byte UTF-8 string will return a smaller number under 5.6.1 on up then it would under previous versions of Perl.

Fortunately, along with Unicode support came the nifty bytes pragma, which can be used to force length() to return the length of a scalar in bytes like it used to. Unfortunately, pre-5.6 versions of Perl don't have bytes.pm, so this routine was born. The trick to enable you to use bytes regardless of whether or not it's present was the work of Liz:

BEGIN { # this hack allows us to "use bytes" or fake it for older (pre-5.6 +.1) # versions of Perl (thanks to Liz from PerlMonks): eval { require bytes }; if ($@) { # couldn't find it, but pretend we did anyway: $INC{'bytes.pm'} = 1; # 5.005_03 doesn't inherit UNIVERSAL::unimport: eval "sub bytes::unimport { return 1 }"; } }
...
sub size_in_bytes ($) { use bytes; return length shift; }

Replies are listed 'Best First'.
Re: Portable length() in bytes.
by ysth (Canon) on Nov 07, 2004 at 20:20 UTC
    I'm really baffled by this; why would you want length in bytes?

      Well, think about it for a moment. What if your scalar contains arbitrary binary data like a JPEG or *.tar.gz file? You don't want the length in "characters" for it because random byte sequences could get mistaken for multi-byte UTF-8 characters, resulting in a shorter length() than you expected.

      A better example is UTF-8 itself. Tell me, how would you send a UTF-8 string with multi-byte UTF-8 characters in it over a network? In addition, how could you do it portably, so your code would work back to version 5.005 of Perl?

      # five UTF smiley faces (three bytes long each): my $string = "\x{263a}\x{263a}\x{263a}\x{263a}\x{263a}"; my $bytes_written = syswrite($socket, $string, length $string);

      Oops. That ends up writing only five bytes to the socket instead of fifteen, because length() returns the length in characters, not bytes, and each of those smiley faces takes up three bytes.

      Use size_in_bytes() instead and it works regardless of what Perl you're using:

      # five UTF smiley faces (three bytes long each): my $string = "\x{263a}\x{263a}\x{263a}\x{263a}\x{263a}"; my $bytes_written = syswrite($socket, $string, size_in_bytes($string)) +;
        In the cases you mention, you would not want to have data that perl has marked as utf8.
      Regardless of the underlying encoding, the computer still deals with these things as bytes. Storage doesn't care whether the stuff you're storing is UTF-8 or ASCII. Nor does transmission over the network. Bytes are still a useful measure of quantity in some domains.

      thor

      Feel the white light, the light within
      Be your own disciple, fan the sparks of will
      For all of us waiting, your kingdom will come

        But if you have to know how much you are storing or transmitting, you need to know what your output file handle is going to do with the data. If the output file handle will be upgrading to utf8, and your data is "\xff123" (4 bytes, 4 characters, in 8-bit encoding), 5 bytes will be written. If the output filehandle downgrades utf8 and you have "\x{ff}123" (4 bytes, 5 characters, in utf8 encoding), you will be writing just 4 bytes. But how long it is in the encoding perl happens to have it stored as is not relevant.
      That's almost as stupid as asking why would you want a pointer instead of a reference, why would you want an int as opposed to a float, or why you would want a hash instead of two parallel arrays of "keys" and "values".
        Almost but not quite. Thanks for contributing to the conversation.