Good morning/afternoon all. First, the essentials. This is perl, v5.8.0 built for i386-linux-thread-multi on both a redhat 7.1 and redhat 8.0 box.
Please excuse the non english characters.
the basics:
our $input = "日本語は少しだ +け分かります。 and sam I am +!"; #japanese characters meanings, there is a little bit of understanding +of the japanese language [at least].
Standard operating procedure for tearing apart a unicode string is @bytes = unpack("U*",$string); but what is doesnt' do (at least for me) is tear them apart by character, it tears them apart by bytes.
So I end up with code like this.
my @bytes = unpack("U*",$input); my $i=0; while(scalar(@bytes)>0) { my $byt=1; $byt=2 if ($bytes[$i] >= 192); $byt=3 if ($bytes[$i] >= 224); $byt=4 if ($bytes[$i] >= 240); $byt=5 if ($bytes[$i] >= 248); print "$bytes[$i]: "; my @spl = splice(@bytes,0,$byt); my $letter = pack("U*",@spl); print $letter." [0x"; foreach (@spl){printf "%2.2X",$_;} print "] "; print "\n"; }
Which will tear the array apart and output the characters, not the bytes that comprise a word. (P.S. I haven't slept in many moons... if Someone can drop me a note on making those if's better before I wake up tommorow, that'd be grand:))
Now, the rub.
if I "use utf8"
the string becomes
Wide character in print at text_kanji.pl line 24. 26085: &#26085;&#26412;&#35486;&#12399;&#23569; [0x65E5672C8A9E306F5C1 +1] Wide character in print at text_kanji.pl line 24. 12375: &#12375;&#12384;&#12369;&#20998;&#12363; [0x3057306030515206304 +B] Wide character in print at text_kanji.pl line 24. 12426: &#12426;&#12414;&#12377;&#12290; [0x308A307E3059300220] 97: a [0x61] 110: n [0x6E] 100: d [0x64] 32: [0x20] 115: s [0x73] 97: a [0x61] 109: m [0x6D] <!--snip-->
but if I dont' use utf8; (not no utf8 but just no mentioned) I get the correct output...
230: &#26085; [0xE697A5] 230: &#26412; [0xE69CAC] 232: &#35486; [0xE8AA9E] 227: &#12399; [0xE381AF] 229: &#23569; [0xE5B091] 227: &#12375; [0xE38197] 227: &#12384; [0xE381A0] 227: &#12369; [0xE38191] 229: &#20998; [0xE58886] 227: &#12363; [0xE3818B] 227: &#12426; [0xE3828A] 227: &#12414; [0xE381BE] 227: &#12377; [0xE38199] 227: &#12290; [0xE38082] <!--snip-->
if I don't "use utf8" I can't utilize Unicode Block matching. Well... actually that doesn't work anyways. in either mode
my $letter = pack("U*",@spl);
will not match any of the three tested namespaces Han,Hiragana,Katakana.
But if utf8 is enabled, it will match the original $input string for Han, Hiragana, Katakana (yet not the created-from-pack'ed one.)
I'm boggled. But I also haven't slept in 24 hours.
So if anyone has any suggestions, I'd love to hear them.
Cheers,
Nick

In reply to Unicode Pack/Unpack Woes by The Ninja K

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.