in reply to Matching  & € type characters with a regex

Just for grins, download the script I posted here a while ago: tlu -- TransLiterate Unicode. Run your "funky" data through that script and see what comes out.

If you see stuff that looks like \x{02BC} or \x{2019} then what you have is utf8 text data with some "wide" characters in it, and your initial problem, as explained by ikegami, is that you aren't looking at it the right way or using the right tools to view it. The "tlu" script converts wide characters into their "literal" hex-numeric code-point form, using perl syntax by default.

Some of your wide characters will have ascii and (single-byte) Latin-1 equivalents (e.g. the apostrophe or right-single-quote mark or the copyright symbol), but some might not. By reading the data as utf8 (the way it's supposed to be read), there are lots of ways in perl to easily fix or remove them as you see fit.

Replies are listed 'Best First'.
Re^2: Matching  & € type characters with a regex
by Rodster001 (Pilgrim) on Feb 13, 2009 at 06:31 UTC
    So, running those three lines though your script I got exactly what you expected:
    Course\x{fffd} Syllabus\x{fffd} Operator\x{2019}s Manual Windows\x{00ae} Version 5.2
    Now, I can think of several ways to go from here. My converting to hex (in my update above) worked but wasn't really ideal (the (R) mark would have been tossed).

    If your goal was to convert those three lines into this:

    Course Syllabus Operator's Manual Windows® Version 5.2
    What would you do next?

    Thanks for the help!

      There's a discussion on normalising accents in The Björk Situation.

      I use thundergnat's approach but note the limitations pointed out by other monks.

      Perhaps this could be adapteded to suit your needs?

      The "fffd" characters are the unicode "replacement character", which is what you get when something tries to convert non-unicode data, assuming that it's some particular character set, into unicode, and the process comes across a byte value or byte sequence that shouldn't really exist in the original character set (and so cannot be mapped to a meaningful unicode character).

      In this particular case, the data going into tlu was utf8 (based on the correct rendering of the "right-single-quote" and the symbol following "Windows", but the characters after "Course" and Syllabus" were already messed up before going into tlu.

      For things that aren't messed up, you either s/widechar/asciichar/g; (e.g. s/\x{2019}/'/g) or you tr/widechar//d (i.e. get rid of them). For fffd, probably best just get rid of it, but maybe figure out what put it there in the first place.