in reply to Best Way to Get Length of UTF-8 String in Bytes?
Is this what I want?
Probably. The problem is that it depends on one important thing:
Whatever in the world do you want it for, anyway?
I cannot ever remember needing it myself: dealing with low-level bytes instead of logical characters is nearly always the wrong way to go about matters.
It’s quite possible that there might be a better approach you just don’t know about.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Best Way to Get Length of UTF-8 String in Bytes?
by Jim (Curate) on Apr 24, 2011 at 01:08 UTC | |
The problem is that it depends on one important thing: Whatever in the world do you want it for, anyway? To fit UTF-8 text into a column in a database management system that does not quantify the size of text in characters, but in bytes. The database management system is not relational, not conventional, and probably not one you've ever heard of. It’s quite possible that there might be a better approach you just don’t know about. Very possible. So if I have a VARCHAR column limit of 32,767 bytes, not characters, how do I trim a UTF-8 string to ensure I don't wrongly try to put more than 32,767 bytes worth of it into a column? Thank you for your reply. I appreciate it. Jim | [reply] |
by John M. Dlugosz (Monsignor) on Apr 24, 2011 at 11:42 UTC | |
So, start at position N of the utf-8 encoded byte string that is the maximum length. While the byte at position N is a continuation byte, decrement N. Now you can truncate to length N. To prevent clipping the accents off a base character or something like that, you can furthermore look at the whole character beginning at N. Check the Unicode Properties to see if it's a modifier or something. If it is, decrement N again repeat.
| [reply] |
by tchrist (Pilgrim) on Apr 24, 2011 at 04:06 UTC | |
To fit UTF-8 text into a column in a database management system that does not quantify the size of text in characters, but in bytes. The database management system is not relational, not conventional, and probably not one you've ever heard of.I feel lucky that all my database experiences in recent memory have involved ones that had no fixed limits on any of their sizes. One still had to encode/decode to UTF-8, but I didn’t have your particular problem. It’s quite possible that there might be a better approach you just don’t know about. Well, Jim, that’s quite a pickle. I think I’m going to renege about my advice in perlfunc. If you encode to UTF‑8 bytes, then you won’t know whether and most especially, where to truncate your string, because you’ve lost the character information. And you really have to have character information, plus more, too. It is vaguely possible that you might be able to arrange something with the \C regex escape for an octet, somehow combining a bytewise assertion of \A\C{0,32767} with one that fits a charwise \A.* or better yet a grapheme-wise ^\X* within that. But that isn’t the approach I finally ended up using, because that sounded too complicated and messy. I decided to do something really simple. My “Pac‑Man®” algorithm is simple: chop until short enough. More specifically, remove the last grapheme until the string has fewer than your maximum bytes. You can do a bit better than blind naïveté by realizing that even at maximum efficiency of one byte per character (pure ASCII), if the actual char length is more than the maximum allowed byte length, you can pre-truncate the character count. That way you don’t come slowly pac-manning back from 100k strings. There are a few things complicating your life. Just as you do not wish to chop off a byte in the middle of a character, neither do you want to chop of a character in the middle of a grapheme. You don’t want "\cM\cJ" to get split if that’s in your data, and you very most especially do not wish to lose a Grapheme_Extend code point like a diacritic or an underline/overline off of its Grapheme_Base code point. What I ended up doing, therefore, was this: That assumes that the strings are Unicode strings with their UTF‑8 flags on. I have done the wickedness of using the bytes namespace. This breaks the encapsulation of abstract characters. I am relying on knowing that the internal byte length is in UTF-8. If that changes — and there is no guarantee at all that it will not do so someday — then this code will break. Also, it is critical that it be required, not used. You do not want byte semantics for your operations; you just want to be able to get a bytes::length on its own. I haven’t benchmarked this against doing it the “pure” way with a bunch of calls to encode_utf8. You might want to do that. But what I have done is run the algorithm against a bunch of strings: some in NFD form, some in NFC; some French and Chinese; some with multiple combining characters; some even with very fat graphemes from up in the SMP with their marks (math letters with overlines). I ran it with MAX == 25 bytes, but I see no reason why it shouldn’t work set to your own 32,767. Here are some examples of the traces:
Here’s the complete program. I’ve uniquoted the strings, so the program itself is actually in pure ASCII, which means I can put in in <code> tags here instead of messing around with icky <pre> and weird escapes. You can download it easily enough if you want, play with the numbers and all, but the heart of algorithm is just the one liner to throw out the last grapheme and check the byte length. You’ll see that I’ve left the debugging in.
There are other ways to go about this, but this seemed to work well enough. Hope it helps. Oh, BTW, if you really want to do print-columns instead of graphemes, look to the Unicode::GCString module; it comes with Unicode::LineBreak. Both are highly recommended. I use them in my unifmt program to do intelligent linebreaking of Asian text per UAX#14. | [reply] [d/l] [select] |
by Jim (Curate) on Apr 24, 2011 at 05:17 UTC | |
Amazing! Thank you, thank you, thank you, Tom! It's a lot for me to assimilate with my limited intellect and meager Perl skills, but I will certainly try. My real objective—as awful as it sounds—is to split arbitrarily long UTF-8 strings into chunks of 32,767-byte substrings and distribute them into a series of VARCHAR columns. It's horrible, I know, but if I don't do it, another Ivfhny Onfvp .ARG programmer will—and much more badly than I. Jim | [reply] |
by ikegami (Patriarch) on Apr 24, 2011 at 05:43 UTC | |
I see use bytes; without any utf8::upgrade or utf8::downgrade, and that usually indicates code that suffers from "The Unicode Bug".
should be
Or the same without bytes:
Update: Added non-bytes alternative. | [reply] [d/l] [select] |
by tchrist (Pilgrim) on Apr 24, 2011 at 06:01 UTC | |
by Anonymous Monk on Apr 24, 2011 at 06:04 UTC | |
by ikegami (Patriarch) on Apr 24, 2011 at 06:06 UTC | |
by Jim (Curate) on Apr 26, 2011 at 21:04 UTC | |
I've studied the demonstration script and I understand everything it's doing, except for this bit:
What's going on here? $MAX_CHARS will always be set to the value of $MAX_BYTES, and $MAX_BPC seems to serve no function. Am I right? Also, what happens if, in the initial truncation of the string done using substr() as an lvalue, we land smack dab in the middle of a grapheme, and the rightmost character in the resultant truncated string is, by itself, a valid grapheme?
Here's the text in the output file cafe.txt:
| [reply] [d/l] [select] |
by tchrist (Pilgrim) on Apr 27, 2011 at 05:16 UTC | |
by ikegami (Patriarch) on Apr 27, 2011 at 07:18 UTC | |
| |
|
Re^2: Best Way to Get Length of UTF-8 String in Bytes?
by BrowserUk (Patriarch) on Apr 24, 2011 at 12:53 UTC | |
Whatever in the world do you want it for, anyway? Obtaining knowledge of the storage requirements for a piece of data does not seem such an unusual requirement to me. Whether it is for sizing a buffer for interfacing to a C (other) language library; or for length-prefixing a packet for a transmission protocol; or for indexing a file; or any of a dozen other legitimate uses. Indeed, given that this information is readily & trivially available to Perl:
the absence of a simple built-in mechanism for obtaining it seems both remiss and arbitrary. But then, this is just another in a long list of reasons why the whole Unicode thing should be dumped in favour of a standard that throws away all the accumulated crude of transitional standards and yesteryears physical and financial restrictions. Given the cheapness of today's ram, variable length encodings make no sense given the restrictions and overheads they impose. And any 'standard' that means that it is impossible to tell what a piece of data actually represents without reference to some external metadata is an equal nonsense. With luck, the current mess will be consigned to the bitbucket of history along with all the other evolutionary dead ends like 6-bit bytes and 36-bit words. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
by tchrist (Pilgrim) on Apr 24, 2011 at 16:13 UTC | |
But then, this is just another in a long list of reasons why the whole Unicode thing should be dumped in favour of a standard that throws away all the accumulated crude of transitional standards and yesteryears physical and financial restrictions.Yes and no. The no parts are that you seem to have confused the UTF‑8 with Unicode. Unicode is here to stay, and does not share in UTF‑8’s flaws. But realistically, you are simply never going to get rid of UTF‑8 as a transfer format. Do you truly think that people are going to put up with the near-quadrupling in space that the gigabytes and gigabytes of large corpora would require if they were stored or transfered as UTF‑32? That will never happen. The yes part is that I agree that int is the new char. No character data type of fixed width should ever be smaller than the number of bits needed to store any and all possible Unicode code points. Because Unicode is a 21‑bit charset, that means you need 32‑bit characters. It also means that everyone who jumped on the broken UCS‑2 or UTF‑16 bandwagon is paying a really wicked price, since UTF‑16 has all the disadvantages of UTF‑8 but none of its advantages. At least Perl didn’t make that particular brain-damaged mistake! It could have been much worse. UTF‑8 is now the de facto standard, and I am very glad that Perl didn’t do the stupid thing that Java and so many others did: just try matching non-BMP code points in character classes, for example. Can’t do it in the UTF-16 languages. Oops! :( | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Apr 24, 2011 at 19:23 UTC | |
Do you truly think that people are going to put up with the near-quadrupling in space that the gigabytes and gigabytes of large corpora would require if they were stored or transfered as UTF‑32? I think that this 'space' argument is a complete crock. Firstly, if saving diskspace is the primary criteria--even though disk space is cheaper today than ever before, then gzipping or even bzip2ing is far, far more effective than variable-length characters. Even if you expand UTF-8 to UTF-32 prior to gzipping, the resultant filesize is hardly affected. Secondly, if saving ram is the criteria then load the data zipped and expand it in memory. Disk is slow and ram is fast. The cost of unpacking on demand is offset, almost if not entirely, by the time saved reading smaller volumes from disk. Finally, if the 'new' encoding scheme addressed the problem of having the data itself identify what it is--through (say) an expanded BOM mark mechanism or similar-- then there would be no need to convert legacy data. It also means that everyone who jumped on the broken UCS‑2 or UTF‑16 bandwagon is paying a really wicked price Maybe so. But then when the UCS-2 schema was adopted, the full Unicode standard was nothing more than a twinkle in the eyes of its beholders. And those that adopted it had 10 or so years of workable solution before Unicode got its act together. (In so far as it has:) I remember the Byte magazine article entitled somthing like "Universal Character Set versus Unicode" circa. 1993? From memory, it came down pretty clearly on the side of UCS at that time. UCS may not be the best solution now, but for the best part of 15 years it offered a solution that Unicode is only getting around matching to in the last couple or 3 years . I can't address the specifics though. I rarely ever encounter anything beyond ascii data, so I've not bothered with it. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
by ikegami (Patriarch) on Apr 25, 2011 at 05:05 UTC | |
UCS-2 isn't variable width, so I think it was an error to mention it.
What advantage does UTF-8 have over UTF-16? I can only think of one UTF-16 has that UTF-8 doesn't: It's not mistakable for iso-8859-*. | [reply] |
by Anonymous Monk on Apr 25, 2011 at 06:40 UTC | |
by ikegami (Patriarch) on Apr 25, 2011 at 06:59 UTC | |
by tchrist (Pilgrim) on Apr 27, 2011 at 05:30 UTC | |
by ikegami (Patriarch) on Apr 27, 2011 at 07:06 UTC | |