in reply to Re: Re: Re: regex for utf-8
in thread regex for utf-8

I am wrapping my head around the bit-masking and conditional bit-shifting I need to do to extract the actual value of the code. The czyborra site http://czyborra.com/utf/ is invaluable, but my head is thick. How do I march down from the high bit of the first byte, testing and then extracting the hex codes from the succeeding bits?

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: regex for utf-8
by Thelonius (Priest) on Feb 28, 2003 at 16:12 UTC
    I am wrapping my head around the bit-masking and conditional bit-shifting I need to do to extract the actual value of the code. The czyborra site http://czyborra.com/utf/ is invaluable, but my head is thick. How do I march down from the high bit of the first byte, testing and then extracting the hex codes from the succeeding bits?
    That's what unpack "U*" does.
      I have RTFM'ed pack() and unpack(), but don't understand its use in this context. What is the "TEMPLATE" being used here?
      unpack TEMPLATE,EXPR unpack does the reverse of pack: it takes a string and expands it out +into a list of values. (In scalar context, it returns merely the firs +t value produced.)
      This line in the manual I find obscure as well, though it seems it would help me if I understood it:
      sub ordinal { unpack("c",$_[0]); } # same as ord()
      Could you explain unpack "U*"? What is the "U"? (something to do with Unicode? I listened to "Well you needn't" (angular piano music)last night in celebration of the post showing the table for converting with cp 1252
        I see that "pack()" is a little more expressive than "unpack()" in the manual. Sorry for the question asked before reading all i could.