Re: Best Way to Get Length of UTF-8 String in Bytes?

Replies are listed 'Best First'.
Re^2: Best Way to Get Length of UTF-8 String in Bytes? by Jim (Curate) on Apr 24, 2011 at 01:41 UTC
Thank you, ikegami. Here's what I had tried before posting my inquiry: #!perl use strict; use warnings; use open qw( :utf8 :std ); use utf8; # 'China' in Simplified Chinese # 中国 # Unicode U+4E2D U+56FD # UTF-8 E4 B8 AD E5 9B BD my $text = '中国'; my $length_in_characters = length $text; print "Length of text '$text' in characters is $length_in_characters\n"; { use bytes; my $length_in_bytes = length $text; print "Length of text '$text' in bytes is $length_in_bytes\n"; } { require Encode; my $bytes = Encode::encode_utf8($text); my $length_in_bytes = length $bytes; print "Length of text '$bytes' in bytes is $length_in_bytes\n"; } And here's its output: Length of text '中国' in characters is 2 Length of text 'ä¸å›½' in bytes is 6 Length of text 'ä¸å›½' in bytes is 6 (I couldn't use <code> tags here due to the Chinese characters in both the script and its output.) Jim	[reply]
Re^3: Best Way to Get Length of UTF-8 String in Bytes? by ikegami (Patriarch) on Apr 24, 2011 at 03:19 UTC
Are you trying to suggest you could use bytes? That would be incorrect. bytes does not give UTF-8, it gives the internal storage format of the string. That may be utf8 (similiar to UTF-8) or just bytes. Here's an example of it giving the incorrect answer: `#!perl use strict; use warnings; use open qw( :encoding(cp437) :std ); use utf8; my $text = chr(0xC9); my $length_in_characters = length $text; print "Length of text '$text' in characters is $length_in_characters\n +"; { use bytes; my $length_in_bytes = length $text; print "Length of text '$text' in bytes is $length_in_bytes\n"; } { require Encode; my $bytes = Encode::encode_utf8($text); my $length_in_bytes = length $bytes; print "Length of text '$bytes' in bytes is $length_in_bytes\n"; }` [download] `Length of text 'É' in characters is 1 Length of text 'É' in bytes is 1 "\x{00c3}" does not map to cp437 at a.pl line 22. "\x{0089}" does not map to cp437 at a.pl line 22. Length of text '\x{00c3}\x{0089}' in bytes is 2` [download]	[reply] [d/l] [select]
Re^4: Best Way to Get Length of UTF-8 String in Bytes? by tchrist (Pilgrim) on Apr 24, 2011 at 05:53 UTC
I don’t know what all that Microsoft noise was for — nor the `use utf8` either for that matter — but we’re all perfectly familiar with “the Unicode bug” thank you very much. And we are also aware of how unlikely it is to a problem for Jim given the data samples he displayed. `% perl -CS -E 'say chr(0xe9)' \| perl -CS -nE 'require bytes; say byte +s::length($_); chomp; say bytes::length($_)' 3 2 % perl -E '$x = "\x{e9}\x{3b1}"; require bytes; say bytes::length($x); + chop $x; say bytes::length($x)' 4 2 % perl -E '$x = "\N{U+E9}"; require bytes; say bytes::length($x)' 2` [download] As you can plainly see, it’s only your own isolated little byte constants that can switch internal representation. All you have to do is ever once have a code point greater than 255 anywhere in the string and it stops being a byte string. You also won’t have a problem if you’ve read in the utf8 from something whose encoding layer is set to utf8. So if he has either of those in his program — which it looks like he does — he can ignore Chicken Little. It won’t bother him. I’ll bet.	[reply] [d/l] [select]
Re^5: Best Way to Get Length of UTF-8 String in Bytes? by ikegami (Patriarch) on Apr 24, 2011 at 06:00 UTC
Re^5: Best Way to Get Length of UTF-8 String in Bytes? by John M. Dlugosz (Monsignor) on Apr 24, 2011 at 11:29 UTC

Length of text '中国' in characters is 2 Length of text 'ä¸­å›½' in bytes is 6 Length of text 'ä¸­å›½' in bytes is 6

Length of text '中国' in characters is 2 Length of text 'ä¸å›½' in bytes is 6 Length of text 'ä¸å›½' in bytes is 6