in reply to generate character string based on byte count !!

I don't know what you mean. "Characters" don't have a length. The actual number of bytes taken by a character in a string is dependent on the coded character set (unicode, latin-1, ascii...) and encoding (for unicode, these include utf-8, utf-16, ucs-2 and ucs-4)

Under utf-8, the first 127 characters take up 1 byte, and higer numbered characters take a variable number of bytes (I'm not sure about the exact encoding, but IIRC it can take up to 4 bytes under the current unicode set). Under ascii and latin-1 all characters are encoded using 1 byte (8 bits). Under ucs-2 all characters take 2 bytes, and under ucs-4 all characters take 4 bytes.

  • Comment on Re: generate character string based on byte count !!

Replies are listed 'Best First'.
Re^2: generate character string based on byte count !!
by hv (Prior) on Dec 08, 2004 at 11:01 UTC

    More info: UTF8 ASCII as implemented in perl requires a second byte for codepoints 0x80 and higher, a third byte at 0x800, a fourth at 0x10000, a fifth at 0x200000, a sixth at 0x4000000 and a seventh at 0x80000000.

    Note that this extends beyond the defined Unicode range, since we may store things other than Unicode characters in our strings - perl supports any integer that fits in a UV (32-bit or 64-bit unsigned integer, depending on your perl build) as a codepoint.

    If I understand the code correctly (Perl_uvuni_to_utf8_flags() in utf8.c), higher codepoints (available only where perl is compiled with 64-bit integer support) use 7 bytes up to 0x1000000000, and a fixed 13 bytes for the rest.

    Hugo

Re^2: generate character string based on byte count !!
by barathbr (Scribe) on Dec 09, 2004 at 10:27 UTC
    I guess I was not very clear with what I had written. In simple terms, what you are saying is correct and matter of fact thats exactly what I want.

    Lets say I want to generate some random japanese characters which are of 2 bytes. Pl. note that I still don't know whether you can encode a japanese character in utf8 or utf16 or whatever the character set maybe.

    Bottom line is, I dont really care about what language the characters get generated in. I shouldn't have used the term 'length'. What I meant was I want to generate a character string composed of characters of 2 bytes each, 4 bytes each etc.

    Hope that clarifies things a bit.

    BrowserUK, tall_man thanks for the response, but it doesn't quite solve my purpose. I hope this post adds a little more clarity to what I seek

    Thanks everyone