The Ninja K has asked for the wisdom of the Perl Monks concerning the following question:
Standard operating procedure for tearing apart a unicode string is @bytes = unpack("U*",$string); but what is doesnt' do (at least for me) is tear them apart by character, it tears them apart by bytes.our $input = "日本語は少しだ +け分かります。 and sam I am +!"; #japanese characters meanings, there is a little bit of understanding +of the japanese language [at least].
Which will tear the array apart and output the characters, not the bytes that comprise a word. (P.S. I haven't slept in many moons... if Someone can drop me a note on making those if's better before I wake up tommorow, that'd be grand:))my @bytes = unpack("U*",$input); my $i=0; while(scalar(@bytes)>0) { my $byt=1; $byt=2 if ($bytes[$i] >= 192); $byt=3 if ($bytes[$i] >= 224); $byt=4 if ($bytes[$i] >= 240); $byt=5 if ($bytes[$i] >= 248); print "$bytes[$i]: "; my @spl = splice(@bytes,0,$byt); my $letter = pack("U*",@spl); print $letter." [0x"; foreach (@spl){printf "%2.2X",$_;} print "] "; print "\n"; }
but if I dont' use utf8; (not no utf8 but just no mentioned) I get the correct output...Wide character in print at text_kanji.pl line 24. 26085: 日本語は少 [0x65E5672C8A9E306F5C1 +1] Wide character in print at text_kanji.pl line 24. 12375: しだけ分か [0x3057306030515206304 +B] Wide character in print at text_kanji.pl line 24. 12426: ります。 [0x308A307E3059300220] 97: a [0x61] 110: n [0x6E] 100: d [0x64] 32: [0x20] 115: s [0x73] 97: a [0x61] 109: m [0x6D] <!--snip-->
if I don't "use utf8" I can't utilize Unicode Block matching. Well... actually that doesn't work anyways. in either mode230: 日 [0xE697A5] 230: 本 [0xE69CAC] 232: 語 [0xE8AA9E] 227: は [0xE381AF] 229: 少 [0xE5B091] 227: し [0xE38197] 227: だ [0xE381A0] 227: け [0xE38191] 229: 分 [0xE58886] 227: か [0xE3818B] 227: り [0xE3828A] 227: ま [0xE381BE] 227: す [0xE38199] 227: 。 [0xE38082] <!--snip-->
will not match any of the three tested namespaces Han,Hiragana,Katakana.my $letter = pack("U*",@spl);
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Unicode Pack/Unpack Woes
by Courage (Parson) on Jan 11, 2003 at 11:14 UTC | |
by The Ninja K (Novice) on Jan 12, 2003 at 08:45 UTC | |
|
Re: Unicode Pack/Unpack Woes
by pg (Canon) on Jan 12, 2003 at 03:51 UTC | |
by The Ninja K (Novice) on Jan 12, 2003 at 08:51 UTC | |
by Anonymous Monk on Jan 13, 2003 at 18:20 UTC |