The | binds more tightly than the comma, so the sprintf can be written, for clarity, like this:

sprintf("%c%c", (0xc0 | ($o >> 6)), (0x80 | ($o & 0x3f)) )

Wow, talk about bit-fiddling. For the meanings of the specific operators, see perldoc perlop, but in short we're making the first byte by shifting over the bits in the original character and then flipping on certain bits, resulting in the high bits of the original character being the low bits of our first byte. Then we're making the second byte by turning off certain bits in the original value (chiefly, the high bits, which are already represented in the first byte) and turning others on. I think.

I don't know enough about unicode to explain the reasons behind the particulars, in terms of which bits end up where. If I were trying to figure it out, I'd draw myself a little diagram...

Something along these lines...
  original value:
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             | | | | | | | | | | | | | | | | |
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

  new value:
            +-+-+-+-+-+-+-+-+          +-+-+-+-+-+-+-+-+
    byte 0: | | | | | | | | |  byte 1: | | | | | | | | |
            +-+-+-+-+-+-+-+-+          +-+-+-+-+-+-+-+-+

And then I'd draw in arrows showing which bits end up where, with ones and zeros in the new value to show which bits just get unilaterally turned on or off...

  original value:
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             | | | | | | | | | | | | | | | | |
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                      | | | | | | | | | | | |
                     / / / / / /   \ \ \ \ \ \
                    / / / / / /     \ \ \ \ \ \
                   / / / / / /       \ \ \ \ \ \
                  / / / / / /         \ \ \ \ \ \
                 | | | | | |           \ \ \ \ \ \
                 | | | | | |            \ \ \ \ \ \
                 | | | | | |             \ \ \ \ \ \
                 | | | | | |              \ \ \ \ \ \
                 | | | | | |               \ \ \ \ \ \
                 | | | | | |                | | | | | |
  new value:     v v v v v v                v v v v v v
            +-+-+-+-+-+-+-+-+          +-+-+-+-+-+-+-+-+
    byte 0: |1|1| | | | | | |  byte 1: |1|0| | | | | | |
            +-+-+-+-+-+-+-+-+          +-+-+-+-+-+-+-+-+

As near as I can figure the first four bits of the original value just get thrown away, but you'd probably have to understand unicode to know why. The good news is, the result will never get mistaken for ASCII, because the top bits are set. One supposes it was designed that way deliberately. (Maybe that's why the top four bits of the original had to be thrown away, to make room for the top bit to be set. Also you can tell between the two bytes of the character which is which by looking at the second bit, and that too is probably a deliberate part of the design.)


Sanity? Oh, yeah, I've got all kinds of sanity. In fact, I've developed whole new kinds of sanity. You can just call me "Mister Sanity". Why, I've got so much sanity it's driving me crazy.

In reply to Re: Intra-Unicode Conversions by jonadab
in thread Intra-Unicode Conversions by kettle

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.