Re: Matching Ā & € type characters with a regex

Just for grins, download the script I posted here a while ago: tlu -- TransLiterate Unicode. Run your "funky" data through that script and see what comes out.

If you see stuff that looks like \x{02BC} or \x{2019} then what you have is utf8 text data with some "wide" characters in it, and your initial problem, as explained by ikegami, is that you aren't looking at it the right way or using the right tools to view it. The "tlu" script converts wide characters into their "literal" hex-numeric code-point form, using perl syntax by default.

Some of your wide characters will have ascii and (single-byte) Latin-1 equivalents (e.g. the apostrophe or right-single-quote mark or the copyright symbol), but some might not. By reading the data as utf8 (the way it's supposed to be read), there are lots of ways in perl to easily fix or remove them as you see fit.

Comment on Re: Matching Ā & € type characters with a regex Select or Download Code

Replies are listed 'Best First'.
Re^2: Matching Ā & € type characters with a regex by Rodster001 (Pilgrim) on Feb 13, 2009 at 06:31 UTC
So, running those three lines though your script I got exactly what you expected: `Course\x{fffd} Syllabus\x{fffd} Operator\x{2019}s Manual Windows\x{00ae} Version 5.2` [download] Now, I can think of several ways to go from here. My converting to hex (in my update above) worked but wasn't really ideal (the (R) mark would have been tossed). If your goal was to convert those three lines into this: `Course Syllabus Operator's Manual Windows® Version 5.2` [download] What would you do next? Thanks for the help!	[reply] [d/l] [select]
Re^3: Matching Ā & € type characters with a regex by wfsp (Abbot) on Feb 13, 2009 at 10:20 UTC
There's a discussion on normalising accents in The Björk Situation. I use thundergnat's approach but note the limitations pointed out by other monks. Perhaps this could be adapteded to suit your needs?	[reply]
Re^3: Matching Ā & € type characters with a regex by graff (Chancellor) on Feb 13, 2009 at 15:02 UTC
The "fffd" characters are the unicode "replacement character", which is what you get when something tries to convert non-unicode data, assuming that it's some particular character set, into unicode, and the process comes across a byte value or byte sequence that shouldn't really exist in the original character set (and so cannot be mapped to a meaningful unicode character). In this particular case, the data going into tlu was utf8 (based on the correct rendering of the "right-single-quote" and the symbol following "Windows", but the characters after "Course" and Syllabus" were already messed up before going into tlu. For things that aren't messed up, you either `s/widechar/asciichar/g;` (e.g. `s/\x{2019}/'/g`) or you `tr/widechar//d` (i.e. get rid of them). For fffd, probably best just get rid of it, but maybe figure out what put it there in the first place.	[reply] [d/l] [select]