in reply to Best Way to Get Length of UTF-8 String in Bytes?

Yes. To find the number of bytes text would take encoded as UTF-8, encode it using UTF-8, then use length.
use charnames qw( :full ); use feature qw( say ); use Encode qw( encode_utf8 ); my $text = "\N{LATIN SMALL LETTER E WITH ACUTE}"; say length $text; # 1 my $utf8 = encode_utf8($text); say length $utf8; # 2

Replies are listed 'Best First'.
Re^2: Best Way to Get Length of UTF-8 String in Bytes?
by Jim (Curate) on Apr 24, 2011 at 01:41 UTC

    Thank you, ikegami.

    Here's what I had tried before posting my inquiry:

    #!perl
    
    use strict;
    use warnings;
    use open qw( :utf8 :std );
    use utf8;
    
    # 'China' in Simplified Chinese
    #          中        国
    # Unicode  U+4E2D    U+56FD
    # UTF-8    E4 B8 AD  E5 9B BD
    
    my $text = '中国';
    my $length_in_characters = length $text;
    print "Length of text '$text' in characters is $length_in_characters\n";
    
    {
        use bytes;
        my $length_in_bytes = length $text;
        print "Length of text '$text' in bytes is $length_in_bytes\n";
    }
    
    {
        require Encode;
        my $bytes = Encode::encode_utf8($text);
        my $length_in_bytes = length $bytes;
        print "Length of text '$bytes' in bytes is $length_in_bytes\n";
    }
    

    And here's its output:

    Length of text '中国' in characters is 2
    Length of text '中国' in bytes is 6
    Length of text '中国' in bytes is 6
    

    (I couldn't use <code> tags here due to the Chinese characters in both the script and its output.)

    Jim

      Are you trying to suggest you could use bytes? That would be incorrect. bytes does not give UTF-8, it gives the internal storage format of the string. That may be utf8 (similiar to UTF-8) or just bytes. Here's an example of it giving the incorrect answer:

      #!perl use strict; use warnings; use open qw( :encoding(cp437) :std ); use utf8; my $text = chr(0xC9); my $length_in_characters = length $text; print "Length of text '$text' in characters is $length_in_characters\n +"; { use bytes; my $length_in_bytes = length $text; print "Length of text '$text' in bytes is $length_in_bytes\n"; } { require Encode; my $bytes = Encode::encode_utf8($text); my $length_in_bytes = length $bytes; print "Length of text '$bytes' in bytes is $length_in_bytes\n"; }
      Length of text 'É' in characters is 1 Length of text 'É' in bytes is 1 "\x{00c3}" does not map to cp437 at a.pl line 22. "\x{0089}" does not map to cp437 at a.pl line 22. Length of text '\x{00c3}\x{0089}' in bytes is 2
        I don’t know what all that Microsoft noise was for — nor the use utf8 either for that matter — but we’re all perfectly familiar with “the Unicode bug” thank you very much.

        And we are also aware of how unlikely it is to a problem for Jim given the data samples he displayed.

        % perl -CS -E 'say chr(0xe9)' | perl -CS -nE 'require bytes; say byte +s::length($_); chomp; say bytes::length($_)' 3 2 % perl -E '$x = "\x{e9}\x{3b1}"; require bytes; say bytes::length($x); + chop $x; say bytes::length($x)' 4 2 % perl -E '$x = "\N{U+E9}"; require bytes; say bytes::length($x)' 2
        As you can plainly see, it’s only your own isolated little byte constants that can switch internal representation. All you have to do is ever once have a code point greater than 255 anywhere in the string and it stops being a byte string. You also won’t have a problem if you’ve read in the utf8 from something whose encoding layer is set to utf8. So if he has either of those in his program — which it looks like he does — he can ignore Chicken Little.

        It won’t bother him. I’ll bet.