Re: generate character string based on byte count !!

I don't know what you mean. "Characters" don't have a length. The actual number of bytes taken by a character in a string is dependent on the coded character set (unicode, latin-1, ascii...) and encoding (for unicode, these include utf-8, utf-16, ucs-2 and ucs-4)

Under utf-8, the first 127 characters take up 1 byte, and higer numbered characters take a variable number of bytes (I'm not sure about the exact encoding, but IIRC it can take up to 4 bytes under the current unicode set). Under ascii and latin-1 all characters are encoded using 1 byte (8 bits). Under ucs-2 all characters take 2 bytes, and under ucs-4 all characters take 4 bytes.

"What should it profit a man, if he should win a flame war, yet lose his cool?"

Comment on Re: generate character string based on byte count !!

Replies are listed 'Best First'.
Re^2: generate character string based on byte count !! by hv (Prior) on Dec 08, 2004 at 11:01 UTC
More info: UTF8 ASCII as implemented in perl requires a second byte for codepoints 0x80 and higher, a third byte at 0x800, a fourth at 0x10000, a fifth at 0x200000, a sixth at 0x4000000 and a seventh at 0x80000000. Note that this extends beyond the defined Unicode range, since we may store things other than Unicode characters in our strings - perl supports any integer that fits in a UV (32-bit or 64-bit unsigned integer, depending on your perl build) as a codepoint. If I understand the code correctly (Perl_uvuni_to_utf8_flags() in utf8.c), higher codepoints (available only where perl is compiled with 64-bit integer support) use 7 bytes up to 0x1000000000, and a fixed 13 bytes for the rest. Hugo	[reply]
Re^2: generate character string based on byte count !! by barathbr (Scribe) on Dec 09, 2004 at 10:27 UTC
I guess I was not very clear with what I had written. In simple terms, what you are saying is correct and matter of fact thats exactly what I want. Lets say I want to generate some random japanese characters which are of 2 bytes. Pl. note that I still don't know whether you can encode a japanese character in utf8 or utf16 or whatever the character set maybe. Bottom line is, I dont really care about what language the characters get generated in. I shouldn't have used the term 'length'. What I meant was I want to generate a character string composed of characters of 2 bytes each, 4 bytes each etc. Hope that clarifies things a bit. BrowserUK, tall_man thanks for the response, but it doesn't quite solve my purpose. I hope this post adds a little more clarity to what I seek Thanks everyone	[reply]