Jim, you’re right about both points. The MAX constants I was setting up when I was going to do something different, then never went back and cleaned up after myself.

The substr truncation in the middle of a grapheme cluster is really ugly. That is what I had been trying so hard to avoid with the whole s/\X$// while too long thing. And you can’t guess on how many add-ons there are. There are some standards that allow you to only buffer enough for ten, but those are not really relevant to general work.

I’m afraid you may have to do something like this instead:

# either this way: $s =~ s/^\X{0,$MAX_CHARS}\K.*//s; # or "by hand", this way: substr($s, pos $s) = "" if $s =~ /^\X{0,$MAX_CHARS}/g;
And then do the backwards peeling-off of graphemes until the byte length is small enough. I wouldn’t count on the second form being faster; measure it if it matters.

That’s just off the top of my head right now, which seeing as it’s way past my bedtime, might be pretty off. Hope this is any help at all.

I keep resisting the urge to break down and do it in C instead. Identifying an extended grapheme cluster by hand is not my idea of a good time. Look at the code to do it in regexec.c from 5.12 or later, the version with all the LVT business. It’s the part that starts at line 3768 right now in the current source tree, right at case CLUMP, and run through line 3979 for the next case. I think you’ll see why I didn’t want to recreate all that business.

3768 case CLUMP: /* Match \X: logical Unicode character. This + is defined as 3769 a Unicode extended Grapheme Cluster */ 3770 /* From http://www.unicode.org/reports/tr29 (5.2 vers +ion). An 3771 extended Grapheme Cluster is: 3772 3773 CR LF 3774 | Prepend* Begin Extend* 3775 | . 3776 3777 Begin is (Hangul-syllable | ! Control) 3778 Extend is (Grapheme_Extend | Spacing_Mark) 3779 Control is [ GCB_Control CR LF ] 3780 3781 The discussion below shows how the code for CLUMP +is derived 3782 from this regex. Note that most of these concepts + are from 3783 property values of the Grapheme Cluster Boundary ( +GCB) property.

And then it goes on for a couple hundred more lines of delight.


In reply to Re^5: Best Way to Get Length of UTF-8 String in Bytes? by tchrist
in thread Best Way to Get Length of UTF-8 String in Bytes? by Jim

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.