Re^3: Perl Modules for handling Non English text

My answer was based upon ANSI C:
GCC on my Intel machine:

#include <stdio.h>
#include <stddef.h>
int main ()
{
   printf ("hello world\n");
   printf ("size of a wide char is %d bytes", sizeof(wchar_t) );
   return (0);
}
/* prints: 
 hello world
 size of a wide char is 2 bytes
*/
[download]

$perl -le 'print ord "\x{FFFFFFFF}"'
4294967295
is just a 32 bit unsigned hex number.

I don't know how many bits Hindi requires.

Update: http://ascii-table.com/unicode.php shows unicode standards. This is complex. But basically 16 bits does it.

Comment on Re^3: Perl Modules for handling Non English text Download Code

Replies are listed 'Best First'.
Re^4: Perl Modules for handling Non English text by ikegami (Patriarch) on Mar 31, 2009 at 02:26 UTC
My answer was based upon ANSI C: In this Perl discussion, that's as relevant as Java using 32-bit wide chars. I can understanding the mistake of bringing it up initially, but why bring it up again. And it's wrong. ANSI C says nothing about `wchar_t` being 16-bit. `sizeof(wchar_t)` can be as small as 1, and it's commonly 4. In fact, your own program betrays you. Also from `gcc` on an Intel: `$ gcc -o a a.c $ a hello world size of a wide char is 4 bytes` [download] 4294967295 is just a 32 bit unsigned hex number. And how did I get that number? By getting the character number of `"\x{FFFFFFFF}"`. Therefore, I had a 32-bit character. But basically 16 bits does it. Your own reference contradicts you. 17 planes of 16 bits = way more than 16 bits. (21, to be precise.) For example, these Chinese chars require more than 16 bits. I don't know how many bits Hindi requires. It varies by encoding, and it can even vary withing an encoding. But it's completely irrelevant. Perl supports all Unicode characters, including the Hindi ones.	[reply] [d/l] [select]
Re^5: Perl Modules for handling Non English text by Marshall (Canon) on Mar 31, 2009 at 04:15 UTC
I think the most commonly used standards are relevant. Most languages can be represented in 8 bits and the vast majority of those that can't be (and Hindi is one of them) can be represented in 16 bits. Yes, I agree that a full representation requires 4 bytes and that Perl can do it. That is not in question! At the "end of the day", I normally work with databases generated by other software that can't do 32 bit characters. Maybe you don't have that limitation, but I do. The original question was how to handle Hindi and the answer is that Perl does fine and "C" does fine with that as this only requires 16 bits.	[reply]
Re^6: Perl Modules for handling Non English text by ikegami (Patriarch) on Mar 31, 2009 at 04:53 UTC
The original question was how to handle Hindi and the answer is that Perl does fine and "C" does fine with that as this only requires 16 bits. That's wrong. Perl can do Hindi because Perl can do Unicode. The number of bits has nothing to do with it. There are 8-bit characters Perl cannot handle, for example, because the characters from that character set aren't in Unicode. And you're wrong about Hindi characters requiring 16 bits. It depends on the encoding. If you're talking about Perl's internal representation, Hindi characters take 3 or 4 bytes. If you're talking about Perl's external representation, Perl uses a 32-bit character set.	[reply]
Re^7: Perl Modules for handling Non English text by Marshall (Canon) on Mar 31, 2009 at 05:08 UTC
Re^8: Perl Modules for handling Non English text by ikegami (Patriarch) on Mar 31, 2009 at 05:17 UTC
Some notes below your chosen depth have not been shown here