in reply to Re^4: Perl Modules for handling Non English text
in thread Perl Modules for handling Non English text

I think the most commonly used standards are relevant. Most languages can be represented in 8 bits and the vast majority of those that can't be (and Hindi is one of them) can be represented in 16 bits. Yes, I agree that a full representation requires 4 bytes and that Perl can do it. That is not in question!

At the "end of the day", I normally work with databases generated by other software that can't do 32 bit characters. Maybe you don't have that limitation, but I do.

The original question was how to handle Hindi and the answer is that Perl does fine and "C" does fine with that as this only requires 16 bits.

  • Comment on Re^5: Perl Modules for handling Non English text

Replies are listed 'Best First'.
Re^6: Perl Modules for handling Non English text
by ikegami (Patriarch) on Mar 31, 2009 at 04:53 UTC

    The original question was how to handle Hindi and the answer is that Perl does fine and "C" does fine with that as this only requires 16 bits.

    That's wrong. Perl can do Hindi because Perl can do Unicode. The number of bits has nothing to do with it. There are 8-bit characters Perl cannot handle, for example, because the characters from that character set aren't in Unicode.

    And you're wrong about Hindi characters requiring 16 bits. It depends on the encoding.

    • If you're talking about Perl's internal representation, Hindi characters take 3 or 4 bytes.
    • If you're talking about Perl's external representation, Perl uses a 32-bit character set.
      Hindi encodes into something that both C and Perl can understand and is <= 16 bits. I will talk with a friend of mine who is a native speaker and get back to you. But I am very confident about this. Full representation of oriental languages can take more bits... no question at all!

        Perl uses iso-8859-1 and UTF-8, so
        Characters U+000000 to U+00007F take one byte.
        Characters U+000080 to U+0000FF take one (iso-8859-1) or two (UTF-8) bytes.
        Characters U+000100 to U+0007FF take two bytes.
        Characters U+000800 to U+00FFFF take three bytes.
        Characters U+010000 to U+10FFFF take four bytes.

        Wikipedia indicates Hindi uses Devanagari script (U+000900..U+00097F). All indic scripts are above U+000900.

        use Encode qw( _utf8_off ); my $x = chr(0x900); _utf8_off( $x ); # Get internal representation. print(length($x), "\n"); # 3