in reply to Re: Best Way to Get Length of UTF-8 String in Bytes?
in thread Best Way to Get Length of UTF-8 String in Bytes?
The problem is that it depends on one important thing: Whatever in the world do you want it for, anyway?
To fit UTF-8 text into a column in a database management system that does not quantify the size of text in characters, but in bytes. The database management system is not relational, not conventional, and probably not one you've ever heard of.
It’s quite possible that there might be a better approach you just don’t know about.
Very possible. So if I have a VARCHAR column limit of 32,767 bytes, not characters, how do I trim a UTF-8 string to ensure I don't wrongly try to put more than 32,767 bytes worth of it into a column?
Thank you for your reply. I appreciate it.
Jim
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Best Way to Get Length of UTF-8 String in Bytes?
by John M. Dlugosz (Monsignor) on Apr 24, 2011 at 11:42 UTC | |
So, start at position N of the utf-8 encoded byte string that is the maximum length. While the byte at position N is a continuation byte, decrement N. Now you can truncate to length N. To prevent clipping the accents off a base character or something like that, you can furthermore look at the whole character beginning at N. Check the Unicode Properties to see if it's a modifier or something. If it is, decrement N again repeat.
| [reply] |
|
Re^3: Best Way to Get Length of UTF-8 String in Bytes?
by tchrist (Pilgrim) on Apr 24, 2011 at 04:06 UTC | |
To fit UTF-8 text into a column in a database management system that does not quantify the size of text in characters, but in bytes. The database management system is not relational, not conventional, and probably not one you've ever heard of.I feel lucky that all my database experiences in recent memory have involved ones that had no fixed limits on any of their sizes. One still had to encode/decode to UTF-8, but I didn’t have your particular problem. It’s quite possible that there might be a better approach you just don’t know about. Well, Jim, that’s quite a pickle. I think I’m going to renege about my advice in perlfunc. If you encode to UTF‑8 bytes, then you won’t know whether and most especially, where to truncate your string, because you’ve lost the character information. And you really have to have character information, plus more, too. It is vaguely possible that you might be able to arrange something with the \C regex escape for an octet, somehow combining a bytewise assertion of \A\C{0,32767} with one that fits a charwise \A.* or better yet a grapheme-wise ^\X* within that. But that isn’t the approach I finally ended up using, because that sounded too complicated and messy. I decided to do something really simple. My “Pac‑Man®” algorithm is simple: chop until short enough. More specifically, remove the last grapheme until the string has fewer than your maximum bytes. You can do a bit better than blind naïveté by realizing that even at maximum efficiency of one byte per character (pure ASCII), if the actual char length is more than the maximum allowed byte length, you can pre-truncate the character count. That way you don’t come slowly pac-manning back from 100k strings. There are a few things complicating your life. Just as you do not wish to chop off a byte in the middle of a character, neither do you want to chop of a character in the middle of a grapheme. You don’t want "\cM\cJ" to get split if that’s in your data, and you very most especially do not wish to lose a Grapheme_Extend code point like a diacritic or an underline/overline off of its Grapheme_Base code point. What I ended up doing, therefore, was this: That assumes that the strings are Unicode strings with their UTF‑8 flags on. I have done the wickedness of using the bytes namespace. This breaks the encapsulation of abstract characters. I am relying on knowing that the internal byte length is in UTF-8. If that changes — and there is no guarantee at all that it will not do so someday — then this code will break. Also, it is critical that it be required, not used. You do not want byte semantics for your operations; you just want to be able to get a bytes::length on its own. I haven’t benchmarked this against doing it the “pure” way with a bunch of calls to encode_utf8. You might want to do that. But what I have done is run the algorithm against a bunch of strings: some in NFD form, some in NFC; some French and Chinese; some with multiple combining characters; some even with very fat graphemes from up in the SMP with their marks (math letters with overlines). I ran it with MAX == 25 bytes, but I see no reason why it shouldn’t work set to your own 32,767. Here are some examples of the traces:
Here’s the complete program. I’ve uniquoted the strings, so the program itself is actually in pure ASCII, which means I can put in in <code> tags here instead of messing around with icky <pre> and weird escapes. You can download it easily enough if you want, play with the numbers and all, but the heart of algorithm is just the one liner to throw out the last grapheme and check the byte length. You’ll see that I’ve left the debugging in.
There are other ways to go about this, but this seemed to work well enough. Hope it helps. Oh, BTW, if you really want to do print-columns instead of graphemes, look to the Unicode::GCString module; it comes with Unicode::LineBreak. Both are highly recommended. I use them in my unifmt program to do intelligent linebreaking of Asian text per UAX#14. | [reply] [d/l] [select] |
by Jim (Curate) on Apr 24, 2011 at 05:17 UTC | |
Amazing! Thank you, thank you, thank you, Tom! It's a lot for me to assimilate with my limited intellect and meager Perl skills, but I will certainly try. My real objective—as awful as it sounds—is to split arbitrarily long UTF-8 strings into chunks of 32,767-byte substrings and distribute them into a series of VARCHAR columns. It's horrible, I know, but if I don't do it, another Ivfhny Onfvp .ARG programmer will—and much more badly than I. Jim | [reply] |
by ikegami (Patriarch) on Apr 24, 2011 at 05:43 UTC | |
I see use bytes; without any utf8::upgrade or utf8::downgrade, and that usually indicates code that suffers from "The Unicode Bug".
should be
Or the same without bytes:
Update: Added non-bytes alternative. | [reply] [d/l] [select] |
by tchrist (Pilgrim) on Apr 24, 2011 at 06:01 UTC | |
That assumes that the strings are Unicode strings with their UTF‑8 flags on.didn’t you understand? | [reply] |
by Anonymous Monk on Apr 24, 2011 at 06:04 UTC | |
by ikegami (Patriarch) on Apr 24, 2011 at 06:06 UTC | |
by Jim (Curate) on Apr 26, 2011 at 21:04 UTC | |
I've studied the demonstration script and I understand everything it's doing, except for this bit:
What's going on here? $MAX_CHARS will always be set to the value of $MAX_BYTES, and $MAX_BPC seems to serve no function. Am I right? Also, what happens if, in the initial truncation of the string done using substr() as an lvalue, we land smack dab in the middle of a grapheme, and the rightmost character in the resultant truncated string is, by itself, a valid grapheme?
Here's the text in the output file cafe.txt:
| [reply] [d/l] [select] |
by tchrist (Pilgrim) on Apr 27, 2011 at 05:16 UTC | |
The substr truncation in the middle of a grapheme cluster is really ugly. That is what I had been trying so hard to avoid with the whole s/\X$// while too long thing. And you can’t guess on how many add-ons there are. There are some standards that allow you to only buffer enough for ten, but those are not really relevant to general work. I’m afraid you may have to do something like this instead: And then do the backwards peeling-off of graphemes until the byte length is small enough. I wouldn’t count on the second form being faster; measure it if it matters. That’s just off the top of my head right now, which seeing as it’s way past my bedtime, might be pretty off. Hope this is any help at all. I keep resisting the urge to break down and do it in C instead. Identifying an extended grapheme cluster by hand is not my idea of a good time. Look at the code to do it in regexec.c from 5.12 or later, the version with all the LVT business. It’s the part that starts at line 3768 right now in the current source tree, right at case CLUMP, and run through line 3979 for the next case. I think you’ll see why I didn’t want to recreate all that business.
And then it goes on for a couple hundred more lines of delight. | [reply] [d/l] [select] |
by ikegami (Patriarch) on Apr 27, 2011 at 07:18 UTC | |
by tchrist (Pilgrim) on Apr 30, 2011 at 15:40 UTC | |