Re^3: Best Way to Get Length of UTF-8 String in Bytes?

Are you trying to suggest you could use bytes? That would be incorrect. bytes does not give UTF-8, it gives the internal storage format of the string. That may be utf8 (similiar to UTF-8) or just bytes. Here's an example of it giving the incorrect answer:

#!perl

use strict;
use warnings;
use open qw( :encoding(cp437) :std );
use utf8;

my $text = chr(0xC9);
my $length_in_characters = length $text;
print "Length of text '$text' in characters is $length_in_characters\n
+";

{
    use bytes;
    my $length_in_bytes = length $text;
    print "Length of text '$text' in bytes is $length_in_bytes\n";
}

{
    require Encode;
    my $bytes = Encode::encode_utf8($text);
    my $length_in_bytes = length $bytes;
    print "Length of text '$bytes' in bytes is $length_in_bytes\n";
}
[download]

Length of text 'Й' in characters is 1
Length of text 'Й' in bytes is 1
"\x{00c3}" does not map to cp437 at a.pl line 22.
"\x{0089}" does not map to cp437 at a.pl line 22.
Length of text '\x{00c3}\x{0089}' in bytes is 2
[download]

Comment on Re^3: Best Way to Get Length of UTF-8 String in Bytes? Select or Download Code

Replies are listed 'Best First'.
Re^4: Best Way to Get Length of UTF-8 String in Bytes? by tchrist (Pilgrim) on Apr 24, 2011 at 05:53 UTC
I don’t know what all that Microsoft noise was for — nor the `use utf8` either for that matter — but we’re all perfectly familiar with “the Unicode bug” thank you very much. And we are also aware of how unlikely it is to a problem for Jim given the data samples he displayed. `% perl -CS -E 'say chr(0xe9)' \| perl -CS -nE 'require bytes; say byte +s::length($_); chomp; say bytes::length($_)' 3 2 % perl -E '$x = "\x{e9}\x{3b1}"; require bytes; say bytes::length($x); + chop $x; say bytes::length($x)' 4 2 % perl -E '$x = "\N{U+E9}"; require bytes; say bytes::length($x)' 2` [download] As you can plainly see, it’s only your own isolated little byte constants that can switch internal representation. All you have to do is ever once have a code point greater than 255 anywhere in the string and it stops being a byte string. You also won’t have a problem if you’ve read in the utf8 from something whose encoding layer is set to utf8. So if he has either of those in his program — which it looks like he does — he can ignore Chicken Little. It won’t bother him. I’ll bet.	[reply] [d/l] [select]
Re^5: Best Way to Get Length of UTF-8 String in Bytes? by ikegami (Patriarch) on Apr 24, 2011 at 06:00 UTC
I don’t know what all that Microsoft noise was for My terminal uses cp437, and the garbage of encoding UTF-8 was there in the OP's output too. It just looks a bit different on my terminal (`'дёе›Ѕ` vs `\x{00c3}\x{0089}`). nor the use utf8 either for that matte Are you suggesting I should have made irrelevant changes to the OP's code? And we are also aware of how unlikely it is to a problem for Jim given the data samples he displayed. What do you mean unlikely? I'd say it's impossible since those characters are above U+00FF. But so what. He's not going to deal with only those two characters. I don't get it. In one breath, you say he should handle NFD. In the next, you say I should only concern myself with the characters he posted.	[reply]
Re^5: Best Way to Get Length of UTF-8 String in Bytes? by John M. Dlugosz (Monsignor) on Apr 24, 2011 at 11:29 UTC
I would agree, the perl implementation is documented to use UTF-8 encoding for one of the two options, and 8-bit chars for the other. It is also explained when each occurs and how they are handled during concatenation, with various options. Certainly is is less problematic and more maintainable to not count on any subtle details that might shift the meaning. Hmm, just what is the 8-bit form? If it's "whatever was read in", it might include characters encoded in multiple bytes, using some other code page. So, I would be inclined to feel safe treating the internal length in bytes as the UTF-8 length if I read in the string from a file using UTF-8 encoding, or it was a string literal in a program whose source file used utf8. I think there is also a utility function somewhere to tell you which mode a string is in. In fact, wouldn't the UTF-8 encoder just check that flag first and realize it's a no-op? So using it would be efficient, if you don't mind copying the string.	[reply]