in reply to Re^2: Best Way to Get Length of UTF-8 String in Bytes?
in thread Best Way to Get Length of UTF-8 String in Bytes?
To fit UTF-8 text into a column in a database management system that does not quantify the size of text in characters, but in bytes. The database management system is not relational, not conventional, and probably not one you've ever heard of.I feel lucky that all my database experiences in recent memory have involved ones that had no fixed limits on any of their sizes. One still had to encode/decode to UTF-8, but I didn’t have your particular problem.
It’s quite possible that there might be a better approach you just don’t know about.Very possible. So if I have a VARCHAR column limit of 32,767 bytes, not characters, how do I trim a UTF-8 string to ensure I don't wrongly try to put more than 32,767 bytes worth of it into a column?
Well, Jim, that’s quite a pickle. I think I’m going to renege about my advice in perlfunc. If you encode to UTF‑8 bytes, then you won’t know whether and most especially, where to truncate your string, because you’ve lost the character information. And you really have to have character information, plus more, too.
It is vaguely possible that you might be able to arrange something with the \C regex escape for an octet, somehow combining a bytewise assertion of \A\C{0,32767} with one that fits a charwise \A.* or better yet a grapheme-wise ^\X* within that.
But that isn’t the approach I finally ended up using, because that sounded too complicated and messy. I decided to do something really simple. My “Pac‑Man®” algorithm is simple: chop until short enough. More specifically, remove the last grapheme until the string has fewer than your maximum bytes.
You can do a bit better than blind naïveté by realizing that even at maximum efficiency of one byte per character (pure ASCII), if the actual char length is more than the maximum allowed byte length, you can pre-truncate the character count. That way you don’t come slowly pac-manning back from 100k strings.
There are a few things complicating your life. Just as you do not wish to chop off a byte in the middle of a character, neither do you want to chop of a character in the middle of a grapheme. You don’t want "\cM\cJ" to get split if that’s in your data, and you very most especially do not wish to lose a Grapheme_Extend code point like a diacritic or an underline/overline off of its Grapheme_Base code point.
What I ended up doing, therefore, was this:
That assumes that the strings are Unicode strings with their UTF‑8 flags on. I have done the wickedness of using the bytes namespace. This breaks the encapsulation of abstract characters. I am relying on knowing that the internal byte length is in UTF-8. If that changes — and there is no guarantee at all that it will not do so someday — then this code will break.require bytes; if (length($s) > $MAX) { substr($s, $MAX) = ""; } $s =~ s/\X\z// until $MAX > bytes::length($s);
Also, it is critical that it be required, not used. You do not want byte semantics for your operations; you just want to be able to get a bytes::length on its own.
I haven’t benchmarked this against doing it the “pure” way with a bunch of calls to encode_utf8. You might want to do that.
But what I have done is run the algorithm against a bunch of strings: some in NFD form, some in NFC; some French and Chinese; some with multiple combining characters; some even with very fat graphemes from up in the SMP with their marks (math letters with overlines). I ran it with MAX == 25 bytes, but I see no reason why it shouldn’t work set to your own 32,767. Here are some examples of the traces:
String was <NFD: tête‐à‐tête tête‐à‐tête>
start string has graphlen 28, charlen 34, bytelen 48
CHARLEN 34 > 25, truncating to 25 CHARS
bytelen 33 still too long, chopping last grapheme
deleted grapheme <e> U+0065, charlen -1, bytelen -1
bytelen 32 still too long, chopping last grapheme
deleted grapheme <t> U+0074, charlen -1, bytelen -1
bytelen 31 still too long, chopping last grapheme
deleted grapheme <ê> U+0065.0302, charlen -2, bytelen -3
bytelen 28 still too long, chopping last grapheme
deleted grapheme <t> U+0074, charlen -1, bytelen -1
bytelen 27 still too long, chopping last grapheme
deleted grapheme < > U+0020, charlen -1, bytelen -1
bytelen 26 still too long, chopping last grapheme
deleted grapheme <e> U+0065, charlen -1, bytelen -1
final string has graphlen 15, charlen 18, bytelen 25
Trunc'd is <NFD: tête‐à‐têt>
String was <NFD 蓝 lán and 绿 lǜ>
start string has graphlen 18, charlen 21, bytelen 28
bytelen 28 still too long, chopping last grapheme
deleted grapheme <ǜ> U+0075.0308.0300, charlen -3, bytelen -5
final string has graphlen 17, charlen 18, bytelen 23
Trunc'd is <NFD 蓝 lán and 绿 l>
String was <Chinese: 青天,白日,满地红>
start string has graphlen 18, charlen 18, bytelen 36
bytelen 36 still too long, chopping last grapheme
deleted grapheme <红> U+7EA2, charlen -1, bytelen -3
bytelen 33 still too long, chopping last grapheme
deleted grapheme <地> U+5730, charlen -1, bytelen -3
bytelen 30 still too long, chopping last grapheme
deleted grapheme <满> U+6EE1, charlen -1, bytelen -3
bytelen 27 still too long, chopping last grapheme
deleted grapheme <,> U+FF0C, charlen -1, bytelen -3
final string has graphlen 14, charlen 14, bytelen 24
Trunc'd is <Chinese: 青天,白日>
String was <NFD: hã̂ç̌k hã̂ç̌k hẫç̌k hẫç̌k>
start string has graphlen 24, charlen 35, bytelen 51
CHARLEN 35 > 25, truncating to 25 CHARS
bytelen 37 still too long, chopping last grapheme
deleted grapheme <ç̌> U+00E7.030C, charlen -2, bytelen -4
bytelen 33 still too long, chopping last grapheme
deleted grapheme <ẫ> U+1EAB, charlen -1, bytelen -3
bytelen 30 still too long, chopping last grapheme
deleted grapheme <h> U+0068, charlen -1, bytelen -1
bytelen 29 still too long, chopping last grapheme
deleted grapheme < > U+0020, charlen -1, bytelen -1
bytelen 28 still too long, chopping last grapheme
deleted grapheme <k> U+006B, charlen -1, bytelen -1
bytelen 27 still too long, chopping last grapheme
deleted grapheme <ç̌> U+0063.0327.030C, charlen -3, bytelen -5
final string has graphlen 12, charlen 16, bytelen 22
Trunc'd is <NFD: hã̂ç̌k hã̂>
String was <𝐂̅ = sqrt[𝐀̅² + 𝐁̅²]>
start string has graphlen 17, charlen 20, bytelen 34
bytelen 34 still too long, chopping last grapheme
deleted grapheme <]> U+005D, charlen -1, bytelen -1
bytelen 33 still too long, chopping last grapheme
deleted grapheme <²> U+00B2, charlen -1, bytelen -2
bytelen 31 still too long, chopping last grapheme
deleted grapheme <𝐁̅> U+1D401.0305, charlen -2, bytelen -6
final string has graphlen 14, charlen 16, bytelen 25
Trunc'd is <𝐂̅ = sqrt[𝐀̅² + >
Here’s the complete program. I’ve uniquoted the strings, so the program itself is actually in pure ASCII, which means I can put in in <code> tags here instead of messing around with icky <pre> and weird escapes. You can download it easily enough if you want, play with the numbers and all, but the heart of algorithm is just the one liner to throw out the last grapheme and check the byte length. You’ll see that I’ve left the debugging in.
There are other ways to go about this, but this seemed to work well enough. Hope it helps.#!/usr/bin/env perl use 5.12.0; use strict; use autodie; use warnings; use utf8; use open qw<:std :utf8>; use charnames qw< :full >; require bytes; my $MAX_BYTES = 25; my ($MIN_BPC, $MAX_BPC) = (1, 4); my $MAX_CHARS = $MAX_BYTES / $MIN_BPC; sub bytelen(_) { require bytes; return bytes::length($_[0]); } sub graphlen(_) { my $count = 0; $count++ while $_[0] =~ /\X/g; return $count; } sub charlen(_) { return length($_[0]); } sub shorten(_) { my $s = $_[0]; printf "\tstart string has graphlen %d, charlen %d, bytelen %d\n", graphlen($s), charlen($s), bytelen($s); if (charlen($s) > $MAX_CHARS) { printf "\tCHARLEN %d > %d, truncating to %d CHARS\n", length($s), $MAX_BYTES, $MAX_CHARS; substr($s, $MAX_CHARS) = ""; } while (bytelen($s) > $MAX_BYTES) { printf "\tbytelen %d still too long, chopping last grapheme\n" +, bytes::length($s); $s =~ s/(\X)\z//; printf "\tdeleted grapheme <%s> U+%v04X, charlen -%d, bytelen +-%d\n", $1, $1, length($1), bytes::length($1); } printf "\tfinal string has graphlen %d, charlen %d, bytelen %d\n", graphlen($s), charlen($s), bytelen($s); return $s; } my @strings = ( "this lines starts a bit too long", "NFC: cr\N{LATIN SMALL LETTER E WITH GRAVE}me br\N{LATIN SMALL LET +TER U WITH CIRCUMFLEX}l\N{LATIN SMALL LETTER E WITH ACUTE}e et cr\N{L +ATIN SMALL LETTER E WITH GRAVE}me br\N{LATIN SMALL LETTER U WITH CIRC +UMFLEX}l\N{LATIN SMALL LETTER E WITH ACUTE}e", "NFC: t\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}te\N{HYPHEN}\N{LATI +N SMALL LETTER A WITH GRAVE}\N{HYPHEN}t\N{LATIN SMALL LETTER E WITH C +IRCUMFLEX}te t\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}te\N{HYPHEN}\N{ +LATIN SMALL LETTER A WITH GRAVE}\N{HYPHEN}t\N{LATIN SMALL LETTER E WI +TH CIRCUMFLEX}te", "NFD: cre\N{COMBINING GRAVE ACCENT}me bru\N{COMBINING CIRCUMFLEX A +CCENT}le\N{COMBINING ACUTE ACCENT}e et cre\N{COMBINING GRAVE ACCENT}m +e bru\N{COMBINING CIRCUMFLEX ACCENT}le\N{COMBINING ACUTE ACCENT}e", "NFD: te\N{COMBINING CIRCUMFLEX ACCENT}te\N{HYPHEN}a\N{COMBINING G +RAVE ACCENT}\N{HYPHEN}te\N{COMBINING CIRCUMFLEX ACCENT}te te\N{COMBIN +ING CIRCUMFLEX ACCENT}te\N{HYPHEN}a\N{COMBINING GRAVE ACCENT}\N{HYPHE +N}te\N{COMBINING CIRCUMFLEX ACCENT}te", "NFC \N{U+84DD} l\N{LATIN SMALL LETTER A WITH ACUTE}n and \N{U+7EF +F} l\N{LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE}", "NFD \N{U+84DD} la\N{COMBINING ACUTE ACCENT}n and \N{U+7EFF} lu\N{ +COMBINING DIAERESIS}\N{COMBINING GRAVE ACCENT}", "XXX NFC q\N{LATIN SMALL LETTER I WITH MACRON}ng ti\N{LATIN SMALL +LETTER A WITH MACRON}n, b\N{LATIN SMALL LETTER A WITH ACUTE}i r\N{LAT +IN SMALL LETTER I WITH GRAVE}, m\N{LATIN SMALL LETTER A WITH CARON}n +d\N{LATIN SMALL LETTER I WITH GRAVE} h\N{LATIN SMALL LETTER O WITH AC +UTE}ng", "XXX NFD qi\N{COMBINING MACRON}ng tia\N{COMBINING MACRON}n, ba\N{C +OMBINING ACUTE ACCENT}i ri\N{COMBINING GRAVE ACCENT}, ma\N{COMBINING +CARON}n di\N{COMBINING GRAVE ACCENT} ho\N{COMBINING ACUTE ACCENT}ng", "Chinese: \N{U+9752}\N{U+5929}\N{FULLWIDTH COMMA}\N{U+767D}\N{U+65 +E5}\N{FULLWIDTH COMMA}\N{U+6EE1}\N{U+5730}\N{U+7EA2}", "normal \N{FULLWIDTH LATIN SMALL LETTER W}\N{FULLWIDTH LATIN SMALL + LETTER I}\N{FULLWIDTH LATIN SMALL LETTER D}\N{FULLWIDTH LATIN SMALL +LETTER E} normal \N{FULLWIDTH LATIN SMALL LETTER W}\N{FULLWIDTH LATIN + SMALL LETTER I}\N{FULLWIDTH LATIN SMALL LETTER D}\N{FULLWIDTH LATIN +SMALL LETTER E}", "NFC: h\N{LATIN SMALL LETTER A WITH TILDE}\N{COMBINING CIRCUMFLEX +ACCENT}\N{LATIN SMALL LETTER C WITH CEDILLA}\N{COMBINING CARON}k ha\N +{COMBINING TILDE}\N{COMBINING CIRCUMFLEX ACCENT}c\N{COMBINING CEDILLA +}\N{COMBINING CARON}k h\N{LATIN SMALL LETTER A WITH CIRCUMFLEX AND TI +LDE}\N{LATIN SMALL LETTER C WITH CEDILLA}\N{COMBINING CARON}k ha\N{CO +MBINING CIRCUMFLEX ACCENT}\N{COMBINING TILDE}c\N{COMBINING CEDILLA}\N +{COMBINING CARON}k", "NFD: ha\N{COMBINING TILDE}\N{COMBINING CIRCUMFLEX ACCENT}c\N{COMB +INING CEDILLA}\N{COMBINING CARON}k ha\N{COMBINING TILDE}\N{COMBINING +CIRCUMFLEX ACCENT}c\N{COMBINING CEDILLA}\N{COMBINING CARON}k ha\N{COM +BINING CIRCUMFLEX ACCENT}\N{COMBINING TILDE}c\N{COMBINING CEDILLA}\N{ +COMBINING CARON}k ha\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING TILDE +}c\N{COMBINING CEDILLA}\N{COMBINING CARON}k", "\N{MATHEMATICAL BOLD CAPITAL C}\N{COMBINING OVERLINE} = sqrt[\N{M +ATHEMATICAL BOLD CAPITAL A}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO} ++ \N{MATHEMATICAL BOLD CAPITAL B}\N{COMBINING OVERLINE}\N{SUPERSCRIPT + TWO}]", "4\N{FRACTION SLASH}3\N{INVISIBLE TIMES}\N{GREEK SMALL LETTER PI}\ +N{INVISIBLE TIMES}r\N{SUPERSCRIPT THREE} 4\N{FRACTION SLASH}3\N{INVIS +IBLE TIMES}\N{GREEK SMALL LETTER PI}\N{INVISIBLE TIMES}r\N{SUPERSCRIP +T THREE} 4\N{FRACTION SLASH}3\N{INVISIBLE TIMES}\N{GREEK SMALL LETTER + PI}\N{INVISIBLE TIMES}r\N{SUPERSCRIPT THREE} 4\N{FRACTION SLASH}3\N{ +INVISIBLE TIMES}\N{GREEK SMALL LETTER PI}\N{INVISIBLE TIMES}r\N{SUPER +SCRIPT THREE}", ); printf "MAX byte length is %d\n\n", $MAX_BYTES; for my $line (@strings) { chomp $line; say "String was <$line>"; my $trunk = shorten($line); say "Trunc'd is <$trunk>\n"; } exit 0;
Oh, BTW, if you really want to do print-columns instead of graphemes, look to the Unicode::GCString module; it comes with Unicode::LineBreak. Both are highly recommended. I use them in my unifmt program to do intelligent linebreaking of Asian text per UAX#14.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: Best Way to Get Length of UTF-8 String in Bytes?
by Jim (Curate) on Apr 24, 2011 at 05:17 UTC | |
|
Re^4: Best Way to Get Length of UTF-8 String in Bytes?
by ikegami (Patriarch) on Apr 24, 2011 at 05:43 UTC | |
by tchrist (Pilgrim) on Apr 24, 2011 at 06:01 UTC | |
by Anonymous Monk on Apr 24, 2011 at 06:04 UTC | |
by ikegami (Patriarch) on Apr 24, 2011 at 06:06 UTC | |
|
Re^4: Best Way to Get Length of UTF-8 String in Bytes?
by Jim (Curate) on Apr 26, 2011 at 21:04 UTC | |
by tchrist (Pilgrim) on Apr 27, 2011 at 05:16 UTC | |
by ikegami (Patriarch) on Apr 27, 2011 at 07:18 UTC | |
by tchrist (Pilgrim) on Apr 30, 2011 at 15:40 UTC |