in reply to regex for utf-8
which is exactly two characters. The first is code C2 or C3 (binary 11000010 or 11000011).([\xC2\xC3]) ([\x80-\xBF])
The second is anything in the range 10000000 through 10111111, or 10xxxxxx, which is a continuation byte.
The "|" is not alternation, since that is not in the pattern but in code. The /e modifier means to evaluate the right-hand-side before replacing.
Now, what needs to be done is take the last two bits from the first byte (10 or 11) and the low six bits from the second byte (the x's) and make the final byte from that. That's what the shifting and masking is for.
— John
|
|---|