Re^3: Matching Ā & € type characters with a regex

The "fffd" characters are the unicode "replacement character", which is what you get when something tries to convert non-unicode data, assuming that it's some particular character set, into unicode, and the process comes across a byte value or byte sequence that shouldn't really exist in the original character set (and so cannot be mapped to a meaningful unicode character).

In this particular case, the data going into tlu was utf8 (based on the correct rendering of the "right-single-quote" and the symbol following "Windows", but the characters after "Course" and Syllabus" were already messed up before going into tlu.

For things that aren't messed up, you either s/widechar/asciichar/g; (e.g. s/\x{2019}/'/g) or you tr/widechar//d (i.e. get rid of them). For fffd, probably best just get rid of it, but maybe figure out what put it there in the first place.

Comment on Re^3: Matching Ā & € type characters with a regex Select or Download Code