in reply to Matching  & € type characters with a regex
If you see stuff that looks like \x{02BC} or \x{2019} then what you have is utf8 text data with some "wide" characters in it, and your initial problem, as explained by ikegami, is that you aren't looking at it the right way or using the right tools to view it. The "tlu" script converts wide characters into their "literal" hex-numeric code-point form, using perl syntax by default.
Some of your wide characters will have ascii and (single-byte) Latin-1 equivalents (e.g. the apostrophe or right-single-quote mark or the copyright symbol), but some might not. By reading the data as utf8 (the way it's supposed to be read), there are lots of ways in perl to easily fix or remove them as you see fit.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Matching  & € type characters with a regex
by Rodster001 (Pilgrim) on Feb 13, 2009 at 06:31 UTC | |
by wfsp (Abbot) on Feb 13, 2009 at 10:20 UTC | |
by graff (Chancellor) on Feb 13, 2009 at 15:02 UTC |