Unicode word wrapping

lestrrat has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Unicode word wrapping by seattlejohn (Deacon) on Dec 09, 2002 at 08:43 UTC
What version of Perl are you using? 5.6.x exhibits some weird behavior with respect to Unicode. In particular, as `perldoc perlunicode` explains: Input and Output Disciplines There is currently no easy way to mark data read from a file or other external source as being utf8. This will be one of the major areas of focus in the near future. So part of the problem may be this: You expect your query parameter is encoded in UTF-8 (I'm assuming), but your script just sees a sequence of extended-ASCII characters. You might be able to get around this by explicitly using `pack "U",...` to reconstruct UTF-8 characters from the input one at a time, but I don't recall if I ever got that technique to work reliably. If you're just trying to ensure that an input string doesn't exceed a particular character length, you should be able to use `length($string)` to get its length in characters rather than bytes. That assumes that you already have it stored internally as UTF-8, of course, and that you haven't done a `use bytes`. Unicode support in 5.8 is supposed to be much improved, but I haven't yet had a chance to try it for myself yet. $perlmonks{seattlejohn} = 'John Clyman';	[reply]
Re: Re: Unicode word wrapping by lestrrat (Deacon) on Dec 09, 2002 at 08:56 UTC
Sorry, I guess I wasn't clear. First, I'm using Perl 5.8 As for the input from the CGI, it's originally in EUC-JP, and then I change it to when I receive it from the browser to UTF-8. This is because we eventually shove it in XML format. I want to wrap THAT utf8 string at a certain column	[reply]
Re: Unicode word wrapping by rasta (Hermit) on Dec 09, 2002 at 09:24 UTC
I'm unsure I unerstood the problem in the right way. But from my point of view I could advice generate all your pages with the following line in the HEAD <META NAME="Content-Type" CONTENT="text/HTML;CHARSET=utf-8"> or add Content-Type: text/html; charset=utf-8 to the HTTP header. And produce all your mages in UTF-8. It will cause automatically switching of the user agent to UFT-8, and so you will have no problem with many codepages. Although take into consideration that specifying different charsets in HTTP and HTML can cause expected behavor of browsers. -- Yuriy Syrota	[reply]
Re: Unicode word wrapping by theorbtwo (Prior) on Dec 09, 2002 at 14:45 UTC
There's a well-defined method for doing this in Unicode: UAX#14: Line Breaking Properties. Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).	[reply]
Re: Unicode word wrapping by ph0enix (Friar) on Dec 09, 2002 at 09:40 UTC
I don't know what can be wrong... What about following code? `... use utf8; my $max_text_len = 40; ... sub split_text { my $intext = shift \|\| return ''; my @result = (); my $index = 0; foreach my $word (split(' ', $intext)) { next if !length($word); if (length("$result[$index] $word") > $max_text_len) { $index++; $result[$index] = $word; } else { $result[$index] .= ($result[$index] ? ' ' : '').$word; } } return join("\n", @result); }` [download]	[reply] [d/l]
Re: Unicode word wrapping by dingus (Friar) on Dec 09, 2002 at 12:15 UTC
Remember that because Japanese doesn't really have spaces between words you can split it ANYWHERE without a problem. So assuming you detect some Japanese then you rule should simpley be instert a \n after every N characters if there isn't one. Dingus `Enter any 47-digit prime number to continue.`	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks