in reply to Re^2: Composite Charset Data to UTF8?
in thread Composite Charset Data to UTF8?
Have a look at the encoding rules of UTF-8.
A valid UTF-8 sequence starts either with 0b0xxxxxxx or with 0b11xxxxxx. So any octet starting with 0xb10xxxxxx is invalid UTF-8:
> perl -wle "print sprintf '%08b', $_ for (0xa9,0xae)" 10101001 10101110
An untested easy check could be to match your string against /[\x80-\xBF]/, which are the hex representations of the bit patterns we've identified:
perl -wle "print sprintf '%08b - %02x', $_,$_ for (0b10000000,0b101111 +11)" 10000000 - 80 10111111 - bf
|
|---|