in reply to Intra-Unicode Conversions
The | binds more tightly than the comma, so the sprintf can be written, for clarity, like this:
sprintf("%c%c", (0xc0 | ($o >> 6)), (0x80 | ($o & 0x3f)) )
Wow, talk about bit-fiddling. For the meanings of the specific operators, see perldoc perlop, but in short we're making the first byte by shifting over the bits in the original character and then flipping on certain bits, resulting in the high bits of the original character being the low bits of our first byte. Then we're making the second byte by turning off certain bits in the original value (chiefly, the high bits, which are already represented in the first byte) and turning others on. I think.
I don't know enough about unicode to explain the reasons behind the particulars, in terms of which bits end up where. If I were trying to figure it out, I'd draw myself a little diagram...
Something along these lines...
original value:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | | | | | | | | | | | | | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
new value:
+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+
byte 0: | | | | | | | | | byte 1: | | | | | | | | |
+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+
And then I'd draw in arrows showing which bits end up where, with ones and zeros in the new value to show which bits just get unilaterally turned on or off...
original value:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | | | | | | | | | | | | | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | | | | | | | | | | |
/ / / / / / \ \ \ \ \ \
/ / / / / / \ \ \ \ \ \
/ / / / / / \ \ \ \ \ \
/ / / / / / \ \ \ \ \ \
| | | | | | \ \ \ \ \ \
| | | | | | \ \ \ \ \ \
| | | | | | \ \ \ \ \ \
| | | | | | \ \ \ \ \ \
| | | | | | \ \ \ \ \ \
| | | | | | | | | | | |
new value: v v v v v v v v v v v v
+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+
byte 0: |1|1| | | | | | | byte 1: |1|0| | | | | | |
+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+
As near as I can figure the first four bits of the original value just get thrown away, but you'd probably have to understand unicode to know why. The good news is, the result will never get mistaken for ASCII, because the top bits are set. One supposes it was designed that way deliberately. (Maybe that's why the top four bits of the original had to be thrown away, to make room for the top bit to be set. Also you can tell between the two bytes of the character which is which by looking at the second bit, and that too is probably a deliberate part of the design.)
|
|---|