comment on

What the rather convoluted regex in your post does is take a character (in the narrow definition thereof: unsigned char in C speak) in the range 0x80 - 0xFF (your basic 'code page playground' of yore) and convert that to its valid UTF-8 representation. It does this in a very naieve way, BTW, that is only valid if the input text is in the code page whose characters 0x80 - 0xFF correspond to Unicode code points U+0080 to U+00FF, which is to say not many of them.

UTF-8 says that any Unicode codepoint in the range U+0080 to U+07FF is encoded in two bytes, with the first three bits (highest order bits) of the first (highest order) byte being 110 and the first two bits of the second byte being 10. The remaining 11 bits are used to store the actual codepoint value. E.g., the character U+00A4 (the currency symbol ¤) is stored as follows:

Codepoint U+00A4 --> hex 0xA4 --> binary 10100100

We need to store 10100100 in the UTF-8 bytes:

110..... 10.....

We distribute 10100100 over the 'points' in the two bytes:

110 00010 10 100100

So U+00A4 in UTF-8 becomes 1100010 10100100 or 0xc2 0xa4.
[download]

Note that if the original text was in ISO 8859-15, 0xA4 is the euro symbol € which would be translated to ¤ by the regex.

Anyway, the bit twiddling in the sprintf does the UTF8 conversion (I'm using jonadab's representation here):

sprintf("%c%c", 

# Build first byte by OR'ing 0xc0 (binary 11000000) with 
# the two highest order bits of the character
          (0xc0 | ($o >> 6)),

# Build the second byte by OR'ing 0x80 (binary 10000000)
# with the lower 6 bits of the character (obtained by
# AND'ing with 0x3f, 00011111)
          (0x80 | ($o & 0x3f))
[download]

Please excuse my gratuitous invention of new English verbs.

CU
Robartes-

In reply to Re: Intra-Unicode Conversions by robartes
in thread Intra-Unicode Conversions by kettle

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.