in reply to Re^7: Perl Modules for handling Non English text
in thread Perl Modules for handling Non English text
Perl uses iso-8859-1 and UTF-8, so
Characters U+000000 to U+00007F take one byte.
Characters U+000080 to U+0000FF take one (iso-8859-1) or two (UTF-8) bytes.
Characters U+000100 to U+0007FF take two bytes.
Characters U+000800 to U+00FFFF take three bytes.
Characters U+010000 to U+10FFFF take four bytes.
Wikipedia indicates Hindi uses Devanagari script (U+000900..U+00097F). All indic scripts are above U+000900.
use Encode qw( _utf8_off ); my $x = chr(0x900); _utf8_off( $x ); # Get internal representation. print(length($x), "\n"); # 3
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^9: Perl Modules for handling Non English text
by Marshall (Canon) on Mar 31, 2009 at 06:24 UTC | |
by ikegami (Patriarch) on Mar 31, 2009 at 06:50 UTC | |
by Marshall (Canon) on Mar 31, 2009 at 07:14 UTC | |
by ikegami (Patriarch) on Mar 31, 2009 at 07:17 UTC | |
by Marshall (Canon) on Apr 02, 2009 at 07:52 UTC |