barathbr has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks, I am in need of some assitance with regards to character bytes

I need to prepare a list of characters which are 1 byte long, 2 byte long or 4 byte long. These in turn are supposed to be input into a different program for validation

My question is, is there a straight forward way of printing a list of characters by just passing arguments to a script as "char length = 1", "char length = 2", "char length = 4". If so could somebody provide me some guidance on how one would go about generating a list of characters based on that.

Any pointers to relevant modules would also be welcome. Thanks in advance.
  • Comment on generate character string based on byte count !!

Replies are listed 'Best First'.
Re: generate character string based on byte count !!
by Joost (Canon) on Dec 08, 2004 at 08:59 UTC
    I don't know what you mean. "Characters" don't have a length. The actual number of bytes taken by a character in a string is dependent on the coded character set (unicode, latin-1, ascii...) and encoding (for unicode, these include utf-8, utf-16, ucs-2 and ucs-4)

    Under utf-8, the first 127 characters take up 1 byte, and higer numbered characters take a variable number of bytes (I'm not sure about the exact encoding, but IIRC it can take up to 4 bytes under the current unicode set). Under ascii and latin-1 all characters are encoded using 1 byte (8 bits). Under ucs-2 all characters take 2 bytes, and under ucs-4 all characters take 4 bytes.

      More info: UTF8 ASCII as implemented in perl requires a second byte for codepoints 0x80 and higher, a third byte at 0x800, a fourth at 0x10000, a fifth at 0x200000, a sixth at 0x4000000 and a seventh at 0x80000000.

      Note that this extends beyond the defined Unicode range, since we may store things other than Unicode characters in our strings - perl supports any integer that fits in a UV (32-bit or 64-bit unsigned integer, depending on your perl build) as a codepoint.

      If I understand the code correctly (Perl_uvuni_to_utf8_flags() in utf8.c), higher codepoints (available only where perl is compiled with 64-bit integer support) use 7 bytes up to 0x1000000000, and a fixed 13 bytes for the rest.

      Hugo

      I guess I was not very clear with what I had written. In simple terms, what you are saying is correct and matter of fact thats exactly what I want.

      Lets say I want to generate some random japanese characters which are of 2 bytes. Pl. note that I still don't know whether you can encode a japanese character in utf8 or utf16 or whatever the character set maybe.

      Bottom line is, I dont really care about what language the characters get generated in. I shouldn't have used the term 'length'. What I meant was I want to generate a character string composed of characters of 2 bytes each, 4 bytes each etc.

      Hope that clarifies things a bit.

      BrowserUK, tall_man thanks for the response, but it doesn't quite solve my purpose. I hope this post adds a little more clarity to what I seek

      Thanks everyone
Re: generate character string based on byte count !!
by BrowserUk (Patriarch) on Dec 08, 2004 at 12:37 UTC

    Are you looking for something like this?

    #! perl -slw use strict; ## Adjust to suit your requirements my %types = ( lower => [ 'a'..'z' ], upper => [ 'A'..'Z' ], number=> [ '0'..'9' ], char => [ 'a'..'z', 'A'..'Z', '0'..'9' ], ); print join ' ', map{ my( $type, $n ) = $_ =~ m[(\w+) length = (\d+)]; join'', map{ $types{ $type }[ rand @{ $types{ $type } } ] } 1 .. $n; } @ARGV; __END__ P:\test>413131 "char length = 4" "number length = 3" "lower length = 6 +" tAwY 828 xkppno [12:35:08.56] P:\test>413131 "char length = 1" "number length = 10" "u +pper length = 2" j 3625181636 OB

    Examine what is said, not who speaks.
    "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
    "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: generate character string based on byte count !!
by Anonymous Monk on Dec 08, 2004 at 11:21 UTC
    You get the byte length of characters when you use the bytes pragma. How about this (untested):
    print "char#\tlength1\tlength2\n"; for (32..500) { print; print "\t"; print length chr; print "\t"; { use bytes; print length chr; no bytes; }; print "\n"; };
Re: generate character string based on byte count !!
by tall_man (Parson) on Dec 08, 2004 at 16:40 UTC
    Are we making this too complicated? If all you want is to create a character string of a given length, perhaps what you need is just the "x" command applied to a simple one-byte ascii character.
    #!/usr/bin/perl -w use strict; my $len = 0; if (@ARGV >= 1) { $len = $ARGV[0]; } $len > 0 || die "Usage: genlen.pl number\n"; print "a" x $len;