Portable length() in bytes.

Around 5.6.1, length() started returning the length of a string in characters instead of the length in bytes. This now means that length() called on a multi-byte UTF-8 string will return a smaller number under 5.6.1 on up then it would under previous versions of Perl.

Fortunately, along with Unicode support came the nifty bytes pragma, which can be used to force length() to return the length of a scalar in bytes like it used to. Unfortunately, pre-5.6 versions of Perl don't have bytes.pm, so this routine was born. The trick to enable you to use bytes regardless of whether or not it's present was the work of Liz:

BEGIN {
    # this hack allows us to "use bytes" or fake it for older (pre-5.6
+.1)
    # versions of Perl (thanks to Liz from PerlMonks):
    eval { require bytes };

    if ($@)
    {
        # couldn't find it, but pretend we did anyway:
        $INC{'bytes.pm'} = 1;

        # 5.005_03 doesn't inherit UNIVERSAL::unimport:
        eval "sub bytes::unimport { return 1 }";
    }
}
[download]

...

sub size_in_bytes ($)
{
    use bytes;

    return length shift;
}
[download]

Comment on Portable length() in bytes. Select or Download Code

Replies are listed 'Best First'.
Re: Portable length() in bytes. by ysth (Canon) on Nov 07, 2004 at 20:20 UTC
I'm really baffled by this; why would you want length in bytes?	[reply]
Re^2: Portable length() in bytes. by William G. Davis (Friar) on Nov 07, 2004 at 21:11 UTC
Well, think about it for a moment. What if your scalar contains arbitrary binary data like a JPEG or *.tar.gz file? You don't want the length in "characters" for it because random byte sequences could get mistaken for multi-byte UTF-8 characters, resulting in a shorter length() than you expected. A better example is UTF-8 itself. Tell me, how would you send a UTF-8 string with multi-byte UTF-8 characters in it over a network? In addition, how could you do it portably, so your code would work back to version 5.005 of Perl? `# five UTF smiley faces (three bytes long each): my $string = "\x{263a}\x{263a}\x{263a}\x{263a}\x{263a}"; my $bytes_written = syswrite($socket, $string, length $string);` [download] Oops. That ends up writing only five bytes to the socket instead of fifteen, because length() returns the length in characters, not bytes, and each of those smiley faces takes up three bytes. Use size_in_bytes() instead and it works regardless of what Perl you're using: `# five UTF smiley faces (three bytes long each): my $string = "\x{263a}\x{263a}\x{263a}\x{263a}\x{263a}"; my $bytes_written = syswrite($socket, $string, size_in_bytes($string)) +;` [download]	[reply] [d/l] [select]
Re^3: Portable length() in bytes. by ysth (Canon) on Nov 07, 2004 at 21:55 UTC
In the cases you mention, you would not want to have data that perl has marked as utf8.	[reply]
Re^4: Portable length() in bytes. by William G. Davis (Friar) on Nov 07, 2004 at 23:19 UTC
Re^5: Portable length() in bytes. by ysth (Canon) on Nov 07, 2004 at 23:34 UTC
Some notes below your chosen depth have not been shown here
Re^2: Portable length() in bytes. by thor (Priest) on Nov 07, 2004 at 20:44 UTC
Regardless of the underlying encoding, the computer still deals with these things as bytes. Storage doesn't care whether the stuff you're storing is UTF-8 or ASCII. Nor does transmission over the network. Bytes are still a useful measure of quantity in some domains. thor `Feel the white light, the light within Be your own disciple, fan the sparks of will For all of us waiting, your kingdom will come`	[reply]
Re^3: Portable length() in bytes. by ysth (Canon) on Nov 07, 2004 at 22:00 UTC
But if you have to know how much you are storing or transmitting, you need to know what your output file handle is going to do with the data. If the output file handle will be upgrading to utf8, and your data is "\xff123" (4 bytes, 4 characters, in 8-bit encoding), 5 bytes will be written. If the output filehandle downgrades utf8 and you have "\x{ff}123" (4 bytes, 5 characters, in utf8 encoding), you will be writing just 4 bytes. But how long it is in the encoding perl happens to have it stored as is not relevant.	[reply]
Re^2: Portable length() in bytes. by DrHyde (Prior) on Nov 08, 2004 at 10:13 UTC
That's almost as stupid as asking why would you want a pointer instead of a reference, why would you want an int as opposed to a float, or why you would want a hash instead of two parallel arrays of "keys" and "values".	[reply]
Re^3: Portable length() in bytes. by ysth (Canon) on Nov 08, 2004 at 18:11 UTC
Almost but not quite. Thanks for contributing to the conversation.	[reply]
Re^4: Portable length() in bytes. by DrHyde (Prior) on Nov 10, 2004 at 09:10 UTC