in reply to Re: Matching  & € type characters with a regex
in thread Matching  & € type characters with a regex

So, running those three lines though your script I got exactly what you expected:
Course\x{fffd} Syllabus\x{fffd} Operator\x{2019}s Manual Windows\x{00ae} Version 5.2
Now, I can think of several ways to go from here. My converting to hex (in my update above) worked but wasn't really ideal (the (R) mark would have been tossed).

If your goal was to convert those three lines into this:

Course Syllabus Operator's Manual Windows® Version 5.2
What would you do next?

Thanks for the help!

Replies are listed 'Best First'.
Re^3: Matching  & € type characters with a regex
by wfsp (Abbot) on Feb 13, 2009 at 10:20 UTC
    There's a discussion on normalising accents in The Björk Situation.

    I use thundergnat's approach but note the limitations pointed out by other monks.

    Perhaps this could be adapteded to suit your needs?

Re^3: Matching  & € type characters with a regex
by graff (Chancellor) on Feb 13, 2009 at 15:02 UTC
    The "fffd" characters are the unicode "replacement character", which is what you get when something tries to convert non-unicode data, assuming that it's some particular character set, into unicode, and the process comes across a byte value or byte sequence that shouldn't really exist in the original character set (and so cannot be mapped to a meaningful unicode character).

    In this particular case, the data going into tlu was utf8 (based on the correct rendering of the "right-single-quote" and the symbol following "Windows", but the characters after "Course" and Syllabus" were already messed up before going into tlu.

    For things that aren't messed up, you either s/widechar/asciichar/g; (e.g. s/\x{2019}/'/g) or you tr/widechar//d (i.e. get rid of them). For fffd, probably best just get rid of it, but maybe figure out what put it there in the first place.