in reply to regex and multibyte strings
Jeffrey Friedl's book Mastering Regular Expressions goes into detail on handling unicode/multi-byte characters in regular expressions. You may wish to start there.