in reply to Re^9: Perl Modules for handling Non English text
in thread Perl Modules for handling Non English text

Your hex math looks fine. I am just saying that Hindi, like most languages can be described with <= 16 bits per character.

And I'm "just saying" no, not in a way Perl understands them. The question was about what Perl understands.

Mind you, decode and other means can be used to decode text in any number of encodings into something Perl understands. But only once its decoded does Perl understands the text to be Hindi character. And once they're decoded, the Hindi characters happen to take at least three bytes.

There's nothing special about 16 bits, so I don't know why you keep bringing it up. You hold high the ability of the characters to be represented by UCS-2le (passing it off as the only encoding), but that has nothing to do with the OP's question or Perl's abilities.

Replies are listed 'Best First'.
Re^11: Perl Modules for handling Non English text
by Marshall (Canon) on Mar 31, 2009 at 07:14 UTC
    Perl is amazingly flexible and there is no question about that at all!

    I think that there is a big jump to 16 chars from 8 bit ones. And I believe that this "jump" involves the source files and input format. There is a bigger jump past that to 32 byte characters.

    We don't disagree about the "power of Perl". But I say that there are complications with source database files.

      I never disputed the possibility of problems with format conversions. The topic never even came up.

      And I believe that this "jump" involves the source files and input format.

      I'm not sure what you mean by that.

      Note that Perl understands UTF-8 source files if you use use utf8;. (Since it seems to matter to you, the Hindi characters would be 3 bytes or more in such files.)

      But I say that there are complications with source database files.

      I'm not sure what you mean by this either. But "source" means reading, right? If you were able to store the character in it in the first place, there's no reason you wouldn't be able to read them back.

        Obviously non-western languages is not my area of expertise! I did learn a few things by investigating this. 7 bit is right for std ASCII. I was looking at another kind of table for some kind HTML encoding where all 8 bit are used. Turns out this Hindi thing also appears to be a non-issue. According to my friend who speaks Hindi, most of these guys buy the English version. Anyway I learned a few things and I'm happy about that. Sometimes investigation of some questions leads to some strange paths.