Force the string into byte mode, then see if it fails to match a regex that finds valid UTF characters. I wrote one a while back but don't know if I could find it. Basically look at the spec: 0-0x7f is ok, or binary 110xxxxx followed by one continuation byte. Now 110xxxxxxx is just 11000000 through 11011111 inclusive, so you can write that as \xC0-\xDF. The continuation is 10xxxxxx. Repeat for the 3 and 4 byte forms: 1110xxxx followed by 2 continuation bytes, and 11110xxx followed by 3 continuation bytes.