in reply to Re^3: Best Way to Get Length of UTF-8 String in Bytes?
in thread Best Way to Get Length of UTF-8 String in Bytes?

I've studied the demonstration script and I understand everything it's doing, except for this bit:

my $MAX_BYTES = 25; my ($MIN_BPC, $MAX_BPC) = (1, 4); my $MAX_CHARS = $MAX_BYTES / $MIN_BPC;

What's going on here? $MAX_CHARS will always be set to the value of $MAX_BYTES, and $MAX_BPC seems to serve no function. Am I right?

Also, what happens if, in the initial truncation of the string done using substr() as an lvalue, we land smack dab in the middle of a grapheme, and the rightmost character in the resultant truncated string is, by itself, a valid grapheme?

D:\>perl -CO -Mcharnames=:full -wE "$MAX = 4; $cafe = qq/cafe\N{COMBIN +ING ACUTE ACCENT}/; say $cafe; substr($cafe, $MAX) = ''; say $cafe;" +> cafe.txt D:\>

Here's the text in the output file cafe.txt:

café
cafe

(Thanks again for this very helpful script!)

Replies are listed 'Best First'.
Re^5: Best Way to Get Length of UTF-8 String in Bytes?
by tchrist (Pilgrim) on Apr 27, 2011 at 05:16 UTC
    Jim, you’re right about both points. The MAX constants I was setting up when I was going to do something different, then never went back and cleaned up after myself.

    The substr truncation in the middle of a grapheme cluster is really ugly. That is what I had been trying so hard to avoid with the whole s/\X$// while too long thing. And you can’t guess on how many add-ons there are. There are some standards that allow you to only buffer enough for ten, but those are not really relevant to general work.

    I’m afraid you may have to do something like this instead:

    # either this way: $s =~ s/^\X{0,$MAX_CHARS}\K.*//s; # or "by hand", this way: substr($s, pos $s) = "" if $s =~ /^\X{0,$MAX_CHARS}/g;
    And then do the backwards peeling-off of graphemes until the byte length is small enough. I wouldn’t count on the second form being faster; measure it if it matters.

    That’s just off the top of my head right now, which seeing as it’s way past my bedtime, might be pretty off. Hope this is any help at all.

    I keep resisting the urge to break down and do it in C instead. Identifying an extended grapheme cluster by hand is not my idea of a good time. Look at the code to do it in regexec.c from 5.12 or later, the version with all the LVT business. It’s the part that starts at line 3768 right now in the current source tree, right at case CLUMP, and run through line 3979 for the next case. I think you’ll see why I didn’t want to recreate all that business.

    3768 case CLUMP: /* Match \X: logical Unicode character. This + is defined as 3769 a Unicode extended Grapheme Cluster */ 3770 /* From http://www.unicode.org/reports/tr29 (5.2 vers +ion). An 3771 extended Grapheme Cluster is: 3772 3773 CR LF 3774 | Prepend* Begin Extend* 3775 | . 3776 3777 Begin is (Hangul-syllable | ! Control) 3778 Extend is (Grapheme_Extend | Spacing_Mark) 3779 Control is [ GCB_Control CR LF ] 3780 3781 The discussion below shows how the code for CLUMP +is derived 3782 from this regex. Note that most of these concepts + are from 3783 property values of the Grapheme Cluster Boundary ( +GCB) property.

    And then it goes on for a couple hundred more lines of delight.

      There are some standards that allow you to only buffer enough for ten, but those are not really relevant to general work.

      Graphemes can extend to over 10 code points? wow! Is there any limit?

        Graphemes can extend to over 10 code points? wow! Is there any limit?
        No, not in general. ☃ I can’t recall where I read some standard suggested leaving enough room for buffering of 10 of them. I’d though it was RFC 3454, but it does not appear to be.

        Consider this example. If you paste that into a Mac Terminal window, you will see stacking a bit like this. That string has up to 21 \pM code points following a \PM one. Placed one \X grapheme per numbered line, that’s this:

        1 Z\x{335}\x{32C}\x{32E}\x{32B}\x{31E}\x{32A}\x{356}\x{354}\x{31 +6}\x{369}\x{350}\x{313}\x{34C}\x{313}\x{30D}\x{36C} 2 A\x{31D}\x{33A}\x{317}\x{31C}\x{32F}\x{331}\x{363}\x{350}\x{34 +6}\x{301} 3 L\x{337}\x{32D}\x{35A}\x{33C}\x{320}\x{318}\x{34E}\x{308}\x{30 +0}\x{302}\x{30B}\x{345} 4 G\x{329}\x{35A}\x{332}\x{355}\x{302}\x{36F}\x{31A}\x{362}\x{36 +1} 5 O\x{337}\x{31C}\x{33C}\x{332}\x{33B}\x{300}\x{311}\x{358}\x{36 +2} 6 \x{32C}\x{324}\x{32D}\x{317}\x{332}\x{317}\x{33A}\x{36B}\x{34 +B}\x{35D} 7 W\x{335}\x{348}\x{354}\x{34E}\x{31E}\x{320}\x{332}\x{32F}\x{34 +C}\x{30C}\x{33E}\x{301} 8 I\x{354}\x{353}\x{36A}\x{365}\x{306}\x{369}\x{36C}\x{300}\x{36 +B}\x{351}\x{310}\x{358}\x{360} 9 L\x{323}\x{324}\x{349}\x{313}\x{350}\x{350}\x{30E}\x{312}\x{30 +1} 10 L\x{316}\x{31F}\x{364}\x{30C}\x{305}\x{36C}\x{302}\x{300}\x{31 +5} 11 \x{31F}\x{325}\x{323}\x{33C}\x{325}\x{352}\x{367}\x{36E}\x{36 +C}\x{352}\x{30B}\x{365}\x{352}\x{35D} 12 C\x{31C}\x{325}\x{331}\x{318}\x{317}\x{353}\x{34D}\x{354}\x{31 +3}\x{312}\x{36D}\x{350}\x{307}\x{35C} 13 O\x{335}\x{319}\x{330}\x{316}\x{353}\x{350}\x{307}\x{311}\x{30 +B}\x{369}\x{304}\x{361} 14 M\x{34A}\x{369}\x{346}\x{34A}\x{34F}\x{330}\x{32B}\x{32D}\x{33 +3}\x{33C}\x{320}\x{34E}\x{361}\x{35D} 15 E\x{33F}\x{346}\x{34F}\x{31E}\x{35E} 16 \x{327}\x{31B}\x{330}\x{35A}\x{31C}\x{318}\x{31E}\x{32B}\x{33 +9}\x{349}\x{308}\x{301}\x{33D}\x{352}\x{352}\x{302}\x{350}\x{36E}\x{3 +4B}\x{30D}\x{358} 17 \x{327}\x{31E}\x{319}\x{317}\x{32A}\x{333}\x{30D}\x{310}\x{30 +0}\x{30C}\x{368}\x{35C} 18 D\x{317}\x{33C}\x{33E}\x{33E} 19 O\x{318}\x{320}\x{313}\x{36F}\x{300}\x{36C} 20 \x{31B}\x{33C}\x{329}\x{32E}\x{354}\x{363}\x{30A}\x{352}\x{33 +F}\x{30F}\x{36D}\x{306} 21 N\x{347}\x{325}\x{324}\x{32F}\x{32E}\x{323}\x{316}\x{316}\x{36 +3}\x{313}\x{36B}\x{368} 22 O\x{334}\x{322}\x{32F}\x{33C}\x{355}\x{32E}\x{33A}\x{330}\x{31 +7}\x{302}\x{313}\x{364}\x{36F}\x{345} 23 T\x{318}\x{308}\x{301}\x{30E}\x{350}\x{303} 24 \x{338}\x{328}\x{349}\x{329}\x{31F}\x{307}\x{342}\x{357}\x{36 +2} 25 A\x{349}\x{31C}\x{31D}\x{325}\x{31D}\x{34E}\x{349}\x{317}\x{30 +0}\x{314}\x{366}\x{30A}\x{313}\x{350}\x{35F} 26 N\x{32B}\x{359}\x{359}\x{319}\x{347}\x{32F}\x{359}\x{355}\x{30 +F}\x{34C}\x{365}\x{364}\x{365}\x{366}\x{350}\x{352}\x{303}\x{315} 27 G\x{323}\x{35A}\x{316}\x{35A}\x{331}\x{34E}\x{363}\x{313}\x{35 +B}\x{30E}\x{36D}\x{304}\x{301} 28 E\x{339}\x{31D}\x{323}\x{35A}\x{339}\x{317}\x{33B}\x{349}\x{36 +A} 29 R\x{34D}\x{317}\x{359}\x{339}\x{351}\x{313}\x{35F} 30 \x{32B}\x{325}\x{324}\x{316}\x{333}\x{33C}\x{355}\x{32F}\x{36 +A}\x{36E}\x{30C}\x{36E}\x{366}\x{361} 31 Z\x{369}\x{34A}\x{350}\x{30D}\x{301}\x{30B}\x{34F}\x{35A}\x{32 +C}\x{326}\x{355}\x{32B}\x{319}\x{329} 32 A\x{337}\x{355}\x{317}\x{318}\x{32C}\x{34D}\x{35B}\x{30D}\x{36 +D}\x{363}\x{36B}\x{36D}\x{31A}\x{362}\x{345} 33 L\x{322}\x{316}\x{320}\x{32B}\x{30D}\x{366}\x{342}\x{302}\x{36 +D}\x{304}\x{35B} 34 G\x{316}\x{319}\x{359}\x{353}\x{33A}\x{368}\x{306}\x{36C} 35 O\x{335}\x{339}\x{32B}\x{35A}\x{349}\x{348}\x{323}\x{356}\x{31 +4}\x{357}\x{36E}

        Which is really quite remarkable, isn’t it? Each line is just one grapheme long. The longest, line number 16, contains 22 code points, which when encoded as UTF‑8, requires 43 bytes of storage. All told, that particular string has 8 words, 35 graphemes (user‐visible characters), 434 code points (programmer‐visible characters), and in UTF‑8 occupies 833 bytes (filesystem‐visible characters).

        Which shows why we avoid the word “characters” when talking about Unicode. :)