The substr truncation in the middle of a grapheme cluster is really ugly. That is what I had been trying so hard to avoid with the whole s/\X$// while too long thing. And you can’t guess on how many add-ons there are. There are some standards that allow you to only buffer enough for ten, but those are not really relevant to general work.
I’m afraid you may have to do something like this instead:
And then do the backwards peeling-off of graphemes until the byte length is small enough. I wouldn’t count on the second form being faster; measure it if it matters.# either this way: $s =~ s/^\X{0,$MAX_CHARS}\K.*//s; # or "by hand", this way: substr($s, pos $s) = "" if $s =~ /^\X{0,$MAX_CHARS}/g;
That’s just off the top of my head right now, which seeing as it’s way past my bedtime, might be pretty off. Hope this is any help at all.
I keep resisting the urge to break down and do it in C instead. Identifying an extended grapheme cluster by hand is not my idea of a good time. Look at the code to do it in regexec.c from 5.12 or later, the version with all the LVT business. It’s the part that starts at line 3768 right now in the current source tree, right at case CLUMP, and run through line 3979 for the next case. I think you’ll see why I didn’t want to recreate all that business.
3768 case CLUMP: /* Match \X: logical Unicode character. This + is defined as 3769 a Unicode extended Grapheme Cluster */ 3770 /* From http://www.unicode.org/reports/tr29 (5.2 vers +ion). An 3771 extended Grapheme Cluster is: 3772 3773 CR LF 3774 | Prepend* Begin Extend* 3775 | . 3776 3777 Begin is (Hangul-syllable | ! Control) 3778 Extend is (Grapheme_Extend | Spacing_Mark) 3779 Control is [ GCB_Control CR LF ] 3780 3781 The discussion below shows how the code for CLUMP +is derived 3782 from this regex. Note that most of these concepts + are from 3783 property values of the Grapheme Cluster Boundary ( +GCB) property.
And then it goes on for a couple hundred more lines of delight.
In reply to Re^5: Best Way to Get Length of UTF-8 String in Bytes?
by tchrist
in thread Best Way to Get Length of UTF-8 String in Bytes?
by Jim
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |