Re^8: Perl Modules for handling Non English text

Perl uses iso-8859-1 and UTF-8, so
Characters U+000000 to U+00007F take one byte.
Characters U+000080 to U+0000FF take one (iso-8859-1) or two (UTF-8) bytes.
Characters U+000100 to U+0007FF take two bytes.
Characters U+000800 to U+00FFFF take three bytes.
Characters U+010000 to U+10FFFF take four bytes.

Wikipedia indicates Hindi uses Devanagari script (U+000900..U+00097F). All indic scripts are above U+000900.

use Encode qw( _utf8_off );
my $x = chr(0x900);
_utf8_off( $x );          # Get internal representation.
print(length($x), "\n");  # 3
[download]

Comment on Re^8: Perl Modules for handling Non English text Download Code

Replies are listed 'Best First'.
Re^9: Perl Modules for handling Non English text by Marshall (Canon) on Mar 31, 2009 at 06:24 UTC
Your hex math looks fine. I am just saying that Hindi, like most languages can be described with <= 16 bits per character. Languages like Chinese or Japanese are more complex. As to how these languages can be expressed as a sequence of more simple things....think about Morse Code...there are a limited number of symbols that can be generated and understood. I do not doubt your book, but I do doubt the practice. I'm sure there are some Europeans and Asians who can speak better on this topic than I can.	[reply]
Re^10: Perl Modules for handling Non English text by ikegami (Patriarch) on Mar 31, 2009 at 06:50 UTC
Your hex math looks fine. I am just saying that Hindi, like most languages can be described with <= 16 bits per character. And I'm "just saying" no, not in a way Perl understands them. The question was about what Perl understands. Mind you, `decode` and other means can be used to decode text in any number of encodings into something Perl understands. But only once its decoded does Perl understands the text to be Hindi character. And once they're decoded, the Hindi characters happen to take at least three bytes. There's nothing special about 16 bits, so I don't know why you keep bringing it up. You hold high the ability of the characters to be represented by UCS-2le (passing it off as the only encoding), but that has nothing to do with the OP's question or Perl's abilities.	[reply] [d/l]
Re^11: Perl Modules for handling Non English text by Marshall (Canon) on Mar 31, 2009 at 07:14 UTC
Perl is amazingly flexible and there is no question about that at all! I think that there is a big jump to 16 chars from 8 bit ones. And I believe that this "jump" involves the source files and input format. There is a bigger jump past that to 32 byte characters. We don't disagree about the "power of Perl". But I say that there are complications with source database files.	[reply]
Re^12: Perl Modules for handling Non English text by ikegami (Patriarch) on Mar 31, 2009 at 07:17 UTC
Re^13: Perl Modules for handling Non English text by Marshall (Canon) on Apr 02, 2009 at 07:52 UTC