I don't know what you mean.
"Characters" don't have a length. The actual number of bytes taken by a character in a string is dependent on the coded character set (unicode, latin-1, ascii...) and encoding (for unicode, these include utf-8, utf-16, ucs-2 and ucs-4)
Under utf-8, the first 127 characters take up 1 byte, and higer numbered characters take a variable number of bytes (I'm not sure about the exact encoding, but IIRC it can take up to 4 bytes under the current unicode set). Under ascii and latin-1 all characters are encoded using 1 byte (8 bits). Under ucs-2 all characters take 2 bytes, and under ucs-4 all characters take 4 bytes.
| [reply] |
More info: UTF8 ASCII as implemented in perl requires a second byte for codepoints 0x80 and higher, a third byte at 0x800, a fourth at 0x10000, a fifth at 0x200000, a sixth at 0x4000000 and a seventh at 0x80000000.
Note that this extends beyond the defined Unicode range, since we may store things other than Unicode characters in our strings - perl supports any integer that fits in a UV (32-bit or 64-bit unsigned integer, depending on your perl build) as a codepoint.
If I understand the code correctly (Perl_uvuni_to_utf8_flags() in utf8.c), higher codepoints (available only where perl is compiled with 64-bit integer support) use 7 bytes up to 0x1000000000, and a fixed 13 bytes for the rest.
Hugo
| [reply] |
I guess I was not very clear with what I had written. In simple terms, what you are saying is correct and matter of fact thats exactly what I want.
Lets say I want to generate some random japanese characters which are of 2 bytes. Pl. note that I still don't know whether you can encode a japanese character in utf8 or utf16 or whatever the character set maybe.
Bottom line is, I dont really care about what language the characters get generated in. I shouldn't have used the term 'length'. What I meant was I want to generate a character string composed of characters of 2 bytes each, 4 bytes each etc.
Hope that clarifies things a bit.
BrowserUK, tall_man thanks for the response, but it doesn't quite solve my purpose. I hope this post adds a little more clarity to what I seek
Thanks everyone
| [reply] |
#! perl -slw
use strict;
## Adjust to suit your requirements
my %types = (
lower => [ 'a'..'z' ],
upper => [ 'A'..'Z' ],
number=> [ '0'..'9' ],
char => [ 'a'..'z', 'A'..'Z', '0'..'9' ],
);
print join ' ', map{
my( $type, $n ) = $_ =~ m[(\w+) length = (\d+)];
join'', map{
$types{ $type }[ rand @{ $types{ $type } } ]
} 1 .. $n;
} @ARGV;
__END__
P:\test>413131 "char length = 4" "number length = 3" "lower length = 6
+"
tAwY 828 xkppno
[12:35:08.56] P:\test>413131 "char length = 1" "number length = 10" "u
+pper length = 2"
j 3625181636 OB
Examine what is said, not who speaks.
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail
"Time is a poor substitute for thought"--theorbtwo
"Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
| [reply] [d/l] |
You get the byte length of characters when you use the bytes pragma. How about this (untested):
print "char#\tlength1\tlength2\n";
for (32..500) {
print;
print "\t";
print length chr;
print "\t";
{
use bytes;
print length chr;
no bytes;
};
print "\n";
};
| [reply] [d/l] |
Are we making this too complicated? If all you want is to create a character string of a given length, perhaps what you need is just the "x" command applied to a simple one-byte ascii character.
#!/usr/bin/perl -w
use strict;
my $len = 0;
if (@ARGV >= 1) {
$len = $ARGV[0];
}
$len > 0 || die "Usage: genlen.pl number\n";
print "a" x $len;
| [reply] [d/l] |