in reply to Perl Modules for handling Non English text

Try http://perldoc.perl.org/perlunicode.html.

Standard ASCII handles characters in the western alphabet including umlaut characters (like in German). But there are only 256 possibilities in 8 bits!. That's not enough for all languages and hence "wide characters", or 16 bit ones.

In general you will find that it is possible to make the User Interface conform to national standards. But you will find that at the low level, western languages, in particular English is the norm.

Perl is like any other computer language. You have to tell it how to interpret the byte stream - is each byte a character or is two bytes a character?

  • Comment on Re: Perl Modules for handling Non English text

Replies are listed 'Best First'.
Re^2: Perl Modules for handling Non English text
by ikegami (Patriarch) on Mar 30, 2009 at 22:14 UTC

    That's not enough for all languages and hence "wide characters", or 16 bit ones.

    Perl's wide chars are 32-bit or 64-bit depending on the build, not 16.

    fmdev10$ perl -le'print ord "\x{FFFFFFFF}"' 4294967295
    persephone$ perl -le'print ord "\x{FFFFFFFFFFFFFFFF}"' 18446744073709551615

    Unicode currently requires 17 bits.

    is each byte a character or is two bytes a character?

    Or something else entirely, as in the following popular encodings: UTF-8 (1-4 bytes per char currently, 1-6 possible), UTF-16le/UTF-16be (2 or 4 bytes per char).

      My answer was based upon ANSI C:
      GCC on my Intel machine:
      #include <stdio.h> #include <stddef.h> int main () { printf ("hello world\n"); printf ("size of a wide char is %d bytes", sizeof(wchar_t) ); return (0); } /* prints: hello world size of a wide char is 2 bytes */
      $perl -le 'print ord "\x{FFFFFFFF}"'
      4294967295
      is just a 32 bit unsigned hex number.

      I don't know how many bits Hindi requires.

      Update: http://ascii-table.com/unicode.php shows unicode standards. This is complex. But basically 16 bits does it.

        My answer was based upon ANSI C:

        In this Perl discussion, that's as relevant as Java using 32-bit wide chars. I can understanding the mistake of bringing it up initially, but why bring it up again.

        And it's wrong. ANSI C says nothing about wchar_t being 16-bit. sizeof(wchar_t) can be as small as 1, and it's commonly 4. In fact, your own program betrays you. Also from gcc on an Intel:

        $ gcc -o a a.c $ a hello world size of a wide char is 4 bytes

        4294967295 is just a 32 bit unsigned hex number.

        And how did I get that number? By getting the character number of "\x{FFFFFFFF}". Therefore, I had a 32-bit character.

        But basically 16 bits does it.

        Your own reference contradicts you. 17 planes of 16 bits = way more than 16 bits. (21, to be precise.)

        For example, these Chinese chars require more than 16 bits.

        I don't know how many bits Hindi requires.

        It varies by encoding, and it can even vary withing an encoding. But it's completely irrelevant. Perl supports all Unicode characters, including the Hindi ones.

Re^2: Perl Modules for handling Non English text
by DrHyde (Prior) on Mar 31, 2009 at 10:21 UTC
    Standard ASCII handles characters in the western alphabet including umlaut characters (like in German). But there are only 256 possibilities in 8 bits!
    ASCII is only 7 bits and does not include any accented characters at all. You are, perhaps, confusing it with ISO-8859-n.
      I think he's referring the "A" Windows system calls, where the "A" stands for ANSI (not ASCII) despite having very little to do with the ANSI character encodings (or the ASCII character encoding).