Unicode Pack/Unpack Woes

The Ninja K has asked for the wisdom of the Perl Monks concerning the following question:

Good morning/afternoon all. First, the essentials. This is perl, v5.8.0 built for i386-linux-thread-multi on both a redhat 7.1 and redhat 8.0 box.
Please excuse the non english characters.
the basics:

our $input = "&#26085;&#26412;&#35486;&#12399;&#23569;&#12375;&#12384;
+&#12369;&#20998;&#12363;&#12426;&#12414;&#12377;&#12290; and sam I am
+!";
#japanese characters meanings, there is a little bit of understanding 
+of the japanese language [at least].
[download]

Standard operating procedure for tearing apart a unicode string is @bytes = unpack("U*",$string); but what is doesnt' do (at least for me) is tear them apart by character, it tears them apart by bytes.
So I end up with code like this.

my @bytes = unpack("U*",$input);
my $i=0;
while(scalar(@bytes)>0)
{
    my $byt=1;    
    $byt=2 if ($bytes[$i] >= 192);
    $byt=3 if ($bytes[$i] >= 224);
    $byt=4 if ($bytes[$i] >= 240);
    $byt=5 if ($bytes[$i] >= 248);
    print "$bytes[$i]: ";
    my @spl = splice(@bytes,0,$byt);
    my $letter = pack("U*",@spl);
    print $letter." [0x";
    foreach (@spl){printf "%2.2X",$_;}
    print "] ";
    print "\n";
}
[download]

Which will tear the array apart and output the characters, not the bytes that comprise a word. (P.S. I haven't slept in many moons... if Someone can drop me a note on making those if's better before I wake up tommorow, that'd be grand:))
Now, the rub.

if I "use utf8"
the string becomes

Wide character in print at text_kanji.pl line 24.
26085: &#26085;&#26412;&#35486;&#12399;&#23569; [0x65E5672C8A9E306F5C1
+1] 

Wide character in print at text_kanji.pl line 24.
12375: &#12375;&#12384;&#12369;&#20998;&#12363; [0x3057306030515206304
+B] 

Wide character in print at text_kanji.pl line 24.
12426: &#12426;&#12414;&#12377;&#12290;  [0x308A307E3059300220] 

97: a [0x61]
110: n [0x6E]
100: d [0x64]
32:   [0x20]
115: s [0x73]
97: a [0x61]
109: m [0x6D]
<!--snip-->
[download]

but if I dont' use utf8; (not no utf8 but just no mentioned) I get the correct output...

230: &#26085; [0xE697A5]
230: &#26412; [0xE69CAC]
232: &#35486; [0xE8AA9E]
227: &#12399; [0xE381AF]
229: &#23569; [0xE5B091]
227: &#12375; [0xE38197]
227: &#12384; [0xE381A0]
227: &#12369; [0xE38191]
229: &#20998; [0xE58886]
227: &#12363; [0xE3818B]
227: &#12426; [0xE3828A]
227: &#12414; [0xE381BE]
227: &#12377; [0xE38199]
227: &#12290; [0xE38082]
<!--snip-->
[download]

if I don't "use utf8" I can't utilize Unicode Block matching. Well... actually that doesn't work anyways. in either mode

    my $letter = pack("U*",@spl);
[download]

will not match any of the three tested namespaces Han,Hiragana,Katakana.
But if utf8 is enabled, it will match the original $input string for Han, Hiragana, Katakana (yet not the created-from-pack'ed one.)
I'm boggled. But I also haven't slept in 24 hours.
So if anyone has any suggestions, I'd love to hear them.
Cheers,
Nick

Comment on Unicode Pack/Unpack Woes Select or Download Code

Replies are listed 'Best First'.
Re: Unicode Pack/Unpack Woes by Courage (Parson) on Jan 11, 2003 at 11:14 UTC
If you do not "use utf8;", then it is predictable that your line `my @bytes = unpack("U*",$input);` end up with bytes but not chars splitting. I advice you to write "use utf8;" at the very start of program but sometimes use local scopes with "no utf8;" when needed. Also, you really do not need to re-implement utf8<->code-points character transformation. It is done simplier: `use utf8; use Encode; my $utf8str = decode("ucs-2",$input); # or "euc-jp", depends on $input` [download] See "perldoc Encode" to see what I mean, it contains a lot of answers to your questions! Courage, the Cowardly Dog	[reply] [d/l] [select]
Re: Re: Unicode Pack/Unpack Woes by The Ninja K (Novice) on Jan 12, 2003 at 08:45 UTC
Thanks for the insight into things courage. While I still think something is amiss, and so I'll post again, use Encode did solve my problems; however, here's what I don't get. `$s2u = Text::Iconv->new("sjis", "utf-8"); $x = $s2u->convert($str); $x = decode("utf-8",$x); #Causes Perl to correctly treat the string as + unicode because utf-8 flag is on [oi].` [download] I should not have to decode something that should already be in utf-8 format, no? but anyways, thanks to that this works and I can move on. But something still doesn't seem right...	[reply] [d/l]
Re: Unicode Pack/Unpack Woes by pg (Canon) on Jan 12, 2003 at 03:51 UTC
A piece of code to demo: the right way to use pack, unpack with unicode how to form unicode string with \x{} (I am wondering whether you are dealing with some XML stuff, as you used that &#; syntax. Just in case, you are not expecting Perl itself to understand the &#; syntax, are you?) use strict; sub display { my $string = shift; use utf8;# as you can see from the result, whether to use utf8, or + bytes is irrelevant in this demo, as "U' forces unicode any way print "\nchar semantics: "; print "$string "; printf "Length = %d, ", length($string); printf "Content = %vd\n", $string; use bytes; print "byte semantics: "; print "$string "; printf "Length = %d, ", length($string); printf "Content = %vd\n", $string; } my $encoded_string; my @decoded_list; { use bytes; print "=========================\n"; print "Case 1: create string from pack, with use bytes\n"; $encoded_string = pack("U", 400, 306); display $encoded_string; @decoded_list = unpack("U", $encoded_string); print join(".", @decoded_list), "\n"; } { use utf8; #not necessary in this case print "=========================\n"; print "Case 2: create string from pack, with use buyes\n"; $encoded_string = pack("U", 400, 306); display $encoded_string; @decoded_list = unpack("U", $encoded_string); print join(".", @decoded_list), "\n"; } { print "=========================\n"; print "Case 3: create string from \\x{}\n"; $encoded_string = "\x{190}\x{132}";#hex value of 400 and 306 display $encoded_string; @decoded_list = unpack("U", $encoded_string); print join(".", @decoded_list), "\n"; } [download]	[reply] [d/l]
Re: Re: (Decimal Char Values) Unicode Pack/Unpack Woes by The Ninja K (Novice) on Jan 12, 2003 at 08:51 UTC
Quick response to part of that response, thanks for the demo code. the &#####; comes from perlmonks.org substituting the japanese text I had thrown in as an example html presentation of it:)	[reply]
Re: Re: Re: (Decimal Char Values) Unicode Pack/Unpack Woes by Anonymous Monk on Jan 13, 2003 at 18:20 UTC
What I want to know is how perlmonks did exactly that (converted UTF-8 to &#nnn; format (in case they played with that, it's amp, pound, digits, semi)), since that's exactly what I need to do! Sad thing is, I know I did it once! And if you can't help me directly, perhaps you can remember how I did it last time? ;) thx.	[reply]