in reply to Re: Best Way to Get Length of UTF-8 String in Bytes?
in thread Best Way to Get Length of UTF-8 String in Bytes?

The problem is that it depends on one important thing: Whatever in the world do you want it for, anyway?

To fit UTF-8 text into a column in a database management system that does not quantify the size of text in characters, but in bytes. The database management system is not relational, not conventional, and probably not one you've ever heard of.

It’s quite possible that there might be a better approach you just don’t know about.

Very possible. So if I have a VARCHAR column limit of 32,767 bytes, not characters, how do I trim a UTF-8 string to ensure I don't wrongly try to put more than 32,767 bytes worth of it into a column?

Thank you for your reply. I appreciate it.

Jim

  • Comment on Re^2: Best Way to Get Length of UTF-8 String in Bytes?

Replies are listed 'Best First'.
Re^3: Best Way to Get Length of UTF-8 String in Bytes?
by John M. Dlugosz (Monsignor) on Apr 24, 2011 at 11:42 UTC
    It is an intentional properly of UTF-8 encoding that although variable length, you can easily figure out that you're in the middle of a character and where whole characters begin. Continuation bytes always start with the bits 10xxxxxx. Single-byte characters always have a high bit of 0 (0xxxxxxx), and multi-byte characters always start with a byte that has as many leading 1 bits as there are bytes total: 110xxxxx for two bytes, 1110xxxx for three bytes, etc.

    So, start at position N of the utf-8 encoded byte string that is the maximum length. While the byte at position N is a continuation byte, decrement N. Now you can truncate to length N.

    To prevent clipping the accents off a base character or something like that, you can furthermore look at the whole character beginning at N. Check the Unicode Properties to see if it's a modifier or something. If it is, decrement N again repeat.

Re^3: Best Way to Get Length of UTF-8 String in Bytes?
by tchrist (Pilgrim) on Apr 24, 2011 at 04:06 UTC
    To fit UTF-8 text into a column in a database management system that does not quantify the size of text in characters, but in bytes. The database management system is not relational, not conventional, and probably not one you've ever heard of.
    I feel lucky that all my database experiences in recent memory have involved ones that had no fixed limits on any of their sizes. One still had to encode/decode to UTF-8, but I didn’t have your particular problem.
    It’s quite possible that there might be a better approach you just don’t know about.

    Very possible. So if I have a VARCHAR column limit of 32,767 bytes, not characters, how do I trim a UTF-8 string to ensure I don't wrongly try to put more than 32,767 bytes worth of it into a column?

    Well, Jim, that’s quite a pickle. I think I’m going to renege about my advice in perlfunc. If you encode to UTF‑8 bytes, then you won’t know whether and most especially, where to truncate your string, because you’ve lost the character information. And you really have to have character information, plus more, too.

    It is vaguely possible that you might be able to arrange something with the \C regex escape for an octet, somehow combining a bytewise assertion of \A\C{0,32767} with one that fits a charwise \A.* or better yet a grapheme-wise ^\X* within that.

    But that isn’t the approach I finally ended up using, because that sounded too complicated and messy. I decided to do something really simple. My “Pac‑Man®” algorithm is simple: chop until short enough. More specifically, remove the last grapheme until the string has fewer than your maximum bytes.

    You can do a bit better than blind naïveté by realizing that even at maximum efficiency of one byte per character (pure ASCII), if the actual char length is more than the maximum allowed byte length, you can pre-truncate the character count. That way you don’t come slowly pac-manning back from 100k strings.

    There are a few things complicating your life. Just as you do not wish to chop off a byte in the middle of a character, neither do you want to chop of a character in the middle of a grapheme. You don’t want "\cM\cJ" to get split if that’s in your data, and you very most especially do not wish to lose a Grapheme_Extend code point like a diacritic or an underline/overline off of its Grapheme_Base code point.

    What I ended up doing, therefore, was this:

    require bytes; if (length($s) > $MAX) { substr($s, $MAX) = ""; } $s =~ s/\X\z// until $MAX > bytes::length($s);
    That assumes that the strings are Unicode strings with their UTF‑8 flags on. I have done the wickedness of using the bytes namespace. This breaks the encapsulation of abstract characters. I am relying on knowing that the internal byte length is in UTF-8. If that changes — and there is no guarantee at all that it will not do so someday — then this code will break.

    Also, it is critical that it be required, not used. You do not want byte semantics for your operations; you just want to be able to get a bytes::length on its own.

    I haven’t benchmarked this against doing it the “pure” way with a bunch of calls to encode_utf8. You might want to do that.

    But what I have done is run the algorithm against a bunch of strings: some in NFD form, some in NFC; some French and Chinese; some with multiple combining characters; some even with very fat graphemes from up in the SMP with their marks (math letters with overlines). I ran it with MAX == 25 bytes, but I see no reason why it shouldn’t work set to your own 32,767. Here are some examples of the traces:

    String was <NFD: tête‐à‐tête tête‐à‐tête>
            start string has graphlen 28, charlen 34, bytelen 48
            CHARLEN 34 > 25, truncating to 25 CHARS
            bytelen 33 still too long, chopping last grapheme
            deleted grapheme <e> U+0065, charlen -1, bytelen -1
            bytelen 32 still too long, chopping last grapheme
            deleted grapheme <t> U+0074, charlen -1, bytelen -1
            bytelen 31 still too long, chopping last grapheme
            deleted grapheme <ê> U+0065.0302, charlen -2, bytelen -3
            bytelen 28 still too long, chopping last grapheme
            deleted grapheme <t> U+0074, charlen -1, bytelen -1
            bytelen 27 still too long, chopping last grapheme
            deleted grapheme < > U+0020, charlen -1, bytelen -1
            bytelen 26 still too long, chopping last grapheme
            deleted grapheme <e> U+0065, charlen -1, bytelen -1
            final string has graphlen 15, charlen 18, bytelen 25
    Trunc'd is <NFD: tête‐à‐têt>
    
    String was <NFD 蓝 lán and 绿 lǜ>
            start string has graphlen 18, charlen 21, bytelen 28
            bytelen 28 still too long, chopping last grapheme
            deleted grapheme <ǜ> U+0075.0308.0300, charlen -3, bytelen -5
            final string has graphlen 17, charlen 18, bytelen 23
    Trunc'd is <NFD 蓝 lán and 绿 l>
    
    String was <Chinese: 青天,白日,满地红>
            start string has graphlen 18, charlen 18, bytelen 36
            bytelen 36 still too long, chopping last grapheme
            deleted grapheme <红> U+7EA2, charlen -1, bytelen -3
            bytelen 33 still too long, chopping last grapheme
            deleted grapheme <地> U+5730, charlen -1, bytelen -3
            bytelen 30 still too long, chopping last grapheme
            deleted grapheme <满> U+6EE1, charlen -1, bytelen -3
            bytelen 27 still too long, chopping last grapheme
            deleted grapheme <,> U+FF0C, charlen -1, bytelen -3
            final string has graphlen 14, charlen 14, bytelen 24
    Trunc'd is <Chinese: 青天,白日>
    
    String was <NFD: hã̂ç̌k hã̂ç̌k hẫç̌k hẫç̌k>
            start string has graphlen 24, charlen 35, bytelen 51
            CHARLEN 35 > 25, truncating to 25 CHARS
            bytelen 37 still too long, chopping last grapheme
            deleted grapheme <ç̌> U+00E7.030C, charlen -2, bytelen -4
            bytelen 33 still too long, chopping last grapheme
            deleted grapheme <ẫ> U+1EAB, charlen -1, bytelen -3
            bytelen 30 still too long, chopping last grapheme
            deleted grapheme <h> U+0068, charlen -1, bytelen -1
            bytelen 29 still too long, chopping last grapheme
            deleted grapheme < > U+0020, charlen -1, bytelen -1
            bytelen 28 still too long, chopping last grapheme
            deleted grapheme <k> U+006B, charlen -1, bytelen -1
            bytelen 27 still too long, chopping last grapheme
            deleted grapheme <ç̌> U+0063.0327.030C, charlen -3, bytelen -5
            final string has graphlen 12, charlen 16, bytelen 22
    Trunc'd is <NFD: hã̂ç̌k hã̂>
    
    String was <𝐂̅ = sqrt[𝐀̅² + 𝐁̅²]>
            start string has graphlen 17, charlen 20, bytelen 34
            bytelen 34 still too long, chopping last grapheme
            deleted grapheme <]> U+005D, charlen -1, bytelen -1
            bytelen 33 still too long, chopping last grapheme
            deleted grapheme <²> U+00B2, charlen -1, bytelen -2
            bytelen 31 still too long, chopping last grapheme
            deleted grapheme <𝐁̅> U+1D401.0305, charlen -2, bytelen -6
            final string has graphlen 14, charlen 16, bytelen 25
    Trunc'd is <𝐂̅ = sqrt[𝐀̅² + >
    Here’s the complete program. I’ve uniquoted the strings, so the program itself is actually in pure ASCII, which means I can put in in <code> tags here instead of messing around with icky <pre> and weird escapes. You can download it easily enough if you want, play with the numbers and all, but the heart of algorithm is just the one liner to throw out the last grapheme and check the byte length. You’ll see that I’ve left the debugging in.
    #!/usr/bin/env perl use 5.12.0; use strict; use autodie; use warnings; use utf8; use open qw<:std :utf8>; use charnames qw< :full >; require bytes; my $MAX_BYTES = 25; my ($MIN_BPC, $MAX_BPC) = (1, 4); my $MAX_CHARS = $MAX_BYTES / $MIN_BPC; sub bytelen(_) { require bytes; return bytes::length($_[0]); } sub graphlen(_) { my $count = 0; $count++ while $_[0] =~ /\X/g; return $count; } sub charlen(_) { return length($_[0]); } sub shorten(_) { my $s = $_[0]; printf "\tstart string has graphlen %d, charlen %d, bytelen %d\n", graphlen($s), charlen($s), bytelen($s); if (charlen($s) > $MAX_CHARS) { printf "\tCHARLEN %d > %d, truncating to %d CHARS\n", length($s), $MAX_BYTES, $MAX_CHARS; substr($s, $MAX_CHARS) = ""; } while (bytelen($s) > $MAX_BYTES) { printf "\tbytelen %d still too long, chopping last grapheme\n" +, bytes::length($s); $s =~ s/(\X)\z//; printf "\tdeleted grapheme <%s> U+%v04X, charlen -%d, bytelen +-%d\n", $1, $1, length($1), bytes::length($1); } printf "\tfinal string has graphlen %d, charlen %d, bytelen %d\n", graphlen($s), charlen($s), bytelen($s); return $s; } my @strings = ( "this lines starts a bit too long", "NFC: cr\N{LATIN SMALL LETTER E WITH GRAVE}me br\N{LATIN SMALL LET +TER U WITH CIRCUMFLEX}l\N{LATIN SMALL LETTER E WITH ACUTE}e et cr\N{L +ATIN SMALL LETTER E WITH GRAVE}me br\N{LATIN SMALL LETTER U WITH CIRC +UMFLEX}l\N{LATIN SMALL LETTER E WITH ACUTE}e", "NFC: t\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}te\N{HYPHEN}\N{LATI +N SMALL LETTER A WITH GRAVE}\N{HYPHEN}t\N{LATIN SMALL LETTER E WITH C +IRCUMFLEX}te t\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}te\N{HYPHEN}\N{ +LATIN SMALL LETTER A WITH GRAVE}\N{HYPHEN}t\N{LATIN SMALL LETTER E WI +TH CIRCUMFLEX}te", "NFD: cre\N{COMBINING GRAVE ACCENT}me bru\N{COMBINING CIRCUMFLEX A +CCENT}le\N{COMBINING ACUTE ACCENT}e et cre\N{COMBINING GRAVE ACCENT}m +e bru\N{COMBINING CIRCUMFLEX ACCENT}le\N{COMBINING ACUTE ACCENT}e", "NFD: te\N{COMBINING CIRCUMFLEX ACCENT}te\N{HYPHEN}a\N{COMBINING G +RAVE ACCENT}\N{HYPHEN}te\N{COMBINING CIRCUMFLEX ACCENT}te te\N{COMBIN +ING CIRCUMFLEX ACCENT}te\N{HYPHEN}a\N{COMBINING GRAVE ACCENT}\N{HYPHE +N}te\N{COMBINING CIRCUMFLEX ACCENT}te", "NFC \N{U+84DD} l\N{LATIN SMALL LETTER A WITH ACUTE}n and \N{U+7EF +F} l\N{LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE}", "NFD \N{U+84DD} la\N{COMBINING ACUTE ACCENT}n and \N{U+7EFF} lu\N{ +COMBINING DIAERESIS}\N{COMBINING GRAVE ACCENT}", "XXX NFC q\N{LATIN SMALL LETTER I WITH MACRON}ng ti\N{LATIN SMALL +LETTER A WITH MACRON}n, b\N{LATIN SMALL LETTER A WITH ACUTE}i r\N{LAT +IN SMALL LETTER I WITH GRAVE}, m\N{LATIN SMALL LETTER A WITH CARON}n +d\N{LATIN SMALL LETTER I WITH GRAVE} h\N{LATIN SMALL LETTER O WITH AC +UTE}ng", "XXX NFD qi\N{COMBINING MACRON}ng tia\N{COMBINING MACRON}n, ba\N{C +OMBINING ACUTE ACCENT}i ri\N{COMBINING GRAVE ACCENT}, ma\N{COMBINING +CARON}n di\N{COMBINING GRAVE ACCENT} ho\N{COMBINING ACUTE ACCENT}ng", "Chinese: \N{U+9752}\N{U+5929}\N{FULLWIDTH COMMA}\N{U+767D}\N{U+65 +E5}\N{FULLWIDTH COMMA}\N{U+6EE1}\N{U+5730}\N{U+7EA2}", "normal \N{FULLWIDTH LATIN SMALL LETTER W}\N{FULLWIDTH LATIN SMALL + LETTER I}\N{FULLWIDTH LATIN SMALL LETTER D}\N{FULLWIDTH LATIN SMALL +LETTER E} normal \N{FULLWIDTH LATIN SMALL LETTER W}\N{FULLWIDTH LATIN + SMALL LETTER I}\N{FULLWIDTH LATIN SMALL LETTER D}\N{FULLWIDTH LATIN +SMALL LETTER E}", "NFC: h\N{LATIN SMALL LETTER A WITH TILDE}\N{COMBINING CIRCUMFLEX +ACCENT}\N{LATIN SMALL LETTER C WITH CEDILLA}\N{COMBINING CARON}k ha\N +{COMBINING TILDE}\N{COMBINING CIRCUMFLEX ACCENT}c\N{COMBINING CEDILLA +}\N{COMBINING CARON}k h\N{LATIN SMALL LETTER A WITH CIRCUMFLEX AND TI +LDE}\N{LATIN SMALL LETTER C WITH CEDILLA}\N{COMBINING CARON}k ha\N{CO +MBINING CIRCUMFLEX ACCENT}\N{COMBINING TILDE}c\N{COMBINING CEDILLA}\N +{COMBINING CARON}k", "NFD: ha\N{COMBINING TILDE}\N{COMBINING CIRCUMFLEX ACCENT}c\N{COMB +INING CEDILLA}\N{COMBINING CARON}k ha\N{COMBINING TILDE}\N{COMBINING +CIRCUMFLEX ACCENT}c\N{COMBINING CEDILLA}\N{COMBINING CARON}k ha\N{COM +BINING CIRCUMFLEX ACCENT}\N{COMBINING TILDE}c\N{COMBINING CEDILLA}\N{ +COMBINING CARON}k ha\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING TILDE +}c\N{COMBINING CEDILLA}\N{COMBINING CARON}k", "\N{MATHEMATICAL BOLD CAPITAL C}\N{COMBINING OVERLINE} = sqrt[\N{M +ATHEMATICAL BOLD CAPITAL A}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO} ++ \N{MATHEMATICAL BOLD CAPITAL B}\N{COMBINING OVERLINE}\N{SUPERSCRIPT + TWO}]", "4\N{FRACTION SLASH}3\N{INVISIBLE TIMES}\N{GREEK SMALL LETTER PI}\ +N{INVISIBLE TIMES}r\N{SUPERSCRIPT THREE} 4\N{FRACTION SLASH}3\N{INVIS +IBLE TIMES}\N{GREEK SMALL LETTER PI}\N{INVISIBLE TIMES}r\N{SUPERSCRIP +T THREE} 4\N{FRACTION SLASH}3\N{INVISIBLE TIMES}\N{GREEK SMALL LETTER + PI}\N{INVISIBLE TIMES}r\N{SUPERSCRIPT THREE} 4\N{FRACTION SLASH}3\N{ +INVISIBLE TIMES}\N{GREEK SMALL LETTER PI}\N{INVISIBLE TIMES}r\N{SUPER +SCRIPT THREE}", ); printf "MAX byte length is %d\n\n", $MAX_BYTES; for my $line (@strings) { chomp $line; say "String was <$line>"; my $trunk = shorten($line); say "Trunc'd is <$trunk>\n"; } exit 0;
    There are other ways to go about this, but this seemed to work well enough. Hope it helps.

    Oh, BTW, if you really want to do print-columns instead of graphemes, look to the Unicode::GCString module; it comes with Unicode::LineBreak. Both are highly recommended. I use them in my unifmt program to do intelligent linebreaking of Asian text per UAX#14.

      Amazing! Thank you, thank you, thank you, Tom!

      It's a lot for me to assimilate with my limited intellect and meager Perl skills, but I will certainly try.

      My real objective—as awful as it sounds—is to split arbitrarily long UTF-8 strings into chunks of 32,767-byte substrings and distribute them into a series of VARCHAR columns. It's horrible, I know, but if I don't do it, another Ivfhny Onfvp .ARG programmer will—and much more badly than I.

      Jim

      I see use bytes; without any utf8::upgrade or utf8::downgrade, and that usually indicates code that suffers from "The Unicode Bug".

      sub bytelen(_) { require bytes; return bytes::length($_[0]); }

      should be

      sub utf8len(_) { utf8::upgrade($_[0]); require bytes; return bytes::length($_[0]); }

      Or the same without bytes:

      sub utf8len(_) { utf8::upgrade($_[0]); Encode::_utf8_off($_[0]); my $utf8len = length($_[0]); Encode::_utf8_on($_[0]); return $utf8len; }

      Update: Added non-bytes alternative.

        And just which part of
        That assumes that the strings are Unicode strings with their UTF‑8 flags on.
        didn’t you understand?

      I've studied the demonstration script and I understand everything it's doing, except for this bit:

      my $MAX_BYTES = 25; my ($MIN_BPC, $MAX_BPC) = (1, 4); my $MAX_CHARS = $MAX_BYTES / $MIN_BPC;

      What's going on here? $MAX_CHARS will always be set to the value of $MAX_BYTES, and $MAX_BPC seems to serve no function. Am I right?

      Also, what happens if, in the initial truncation of the string done using substr() as an lvalue, we land smack dab in the middle of a grapheme, and the rightmost character in the resultant truncated string is, by itself, a valid grapheme?

      D:\>perl -CO -Mcharnames=:full -wE "$MAX = 4; $cafe = qq/cafe\N{COMBIN +ING ACUTE ACCENT}/; say $cafe; substr($cafe, $MAX) = ''; say $cafe;" +> cafe.txt D:\>

      Here's the text in the output file cafe.txt:

      café
      cafe
      

      (Thanks again for this very helpful script!)

        Jim, you’re right about both points. The MAX constants I was setting up when I was going to do something different, then never went back and cleaned up after myself.

        The substr truncation in the middle of a grapheme cluster is really ugly. That is what I had been trying so hard to avoid with the whole s/\X$// while too long thing. And you can’t guess on how many add-ons there are. There are some standards that allow you to only buffer enough for ten, but those are not really relevant to general work.

        I’m afraid you may have to do something like this instead:

        # either this way: $s =~ s/^\X{0,$MAX_CHARS}\K.*//s; # or "by hand", this way: substr($s, pos $s) = "" if $s =~ /^\X{0,$MAX_CHARS}/g;
        And then do the backwards peeling-off of graphemes until the byte length is small enough. I wouldn’t count on the second form being faster; measure it if it matters.

        That’s just off the top of my head right now, which seeing as it’s way past my bedtime, might be pretty off. Hope this is any help at all.

        I keep resisting the urge to break down and do it in C instead. Identifying an extended grapheme cluster by hand is not my idea of a good time. Look at the code to do it in regexec.c from 5.12 or later, the version with all the LVT business. It’s the part that starts at line 3768 right now in the current source tree, right at case CLUMP, and run through line 3979 for the next case. I think you’ll see why I didn’t want to recreate all that business.

        3768 case CLUMP: /* Match \X: logical Unicode character. This + is defined as 3769 a Unicode extended Grapheme Cluster */ 3770 /* From http://www.unicode.org/reports/tr29 (5.2 vers +ion). An 3771 extended Grapheme Cluster is: 3772 3773 CR LF 3774 | Prepend* Begin Extend* 3775 | . 3776 3777 Begin is (Hangul-syllable | ! Control) 3778 Extend is (Grapheme_Extend | Spacing_Mark) 3779 Control is [ GCB_Control CR LF ] 3780 3781 The discussion below shows how the code for CLUMP +is derived 3782 from this regex. Note that most of these concepts + are from 3783 property values of the Grapheme Cluster Boundary ( +GCB) property.

        And then it goes on for a couple hundred more lines of delight.