in reply to Size of scalar in bytes

This node falls below the community's minimum standard of quality and will not be displayed.

Replies are listed 'Best First'.
Re: Re: Size of scalar in bytes
by hardburn (Abbot) on Nov 17, 2003 at 17:45 UTC

    That gives the number of characters in $scalar, which may or may not also be the number of bytes (depending on the encoding (ASCII? UTF-8? Full Unicode?) and probably a bunch of other things I don't even know about). To get the number of bytes (as the OP asked), you need to use bytes; before calling length.

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    : () { :|:& };:

    Note: All code is untested, unless otherwise stated

      I'm curious: is there a way to get the number of bytes per character for the encoding in use? If so, I imagine you could do:

      my $len = length($scalar) * $current_encoding_byte_per_char;

      Is that plausible?

        In common cases, yes. For example, you can always be sure that ASCII is 8 bits per character (well, 7 really, but nobody stores it like that in practice). It gets a little harder with weird encodings like RAD-50, where each character actually takes 5 and a third bits per character (yup, a non-integer number of bits).

        Once you start thinking in terms of Unicode, you should basically give up trying to figure out how many bytes a given character takes. Even UTF-8 encoding allows you to mark a character as having a variable-length number of bits. So unless you're working on the dark internals of handling Unicode, just use bytes (which you should probably have done even if you weren't using Unicode).

        If you're intrested, see http://www.sidhe.org/~dan/blog/archives/000255.html.

        ----
        I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
        -- Schemer

        : () { :|:& };:

        Note: All code is untested, unless otherwise stated

        is there a way to get the number of bytes per character for the encoding in use?

        Some encodings use a fixed bytes/character ratio, but some like UTF-8 do not.

        As hardburn pointed out, if use bytes is in effect then length() returns the length in bytes rather than characters:

        sub byte_length { use bytes; return length $_[0]; }