in reply to regexing for non-standard characters...
Try using this script on your data: unichist -- count/summarize characters in data -- it will show you a list of all the distinct code points, and how many times each one occurs. It expects utf8 text by default, but if your data comes in some other encoding, you can specify that in a command-line option ("--enc=..."); the output will always be in terms of unicode code points.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: regexing for non-standard characters...
by Jim (Curate) on Apr 17, 2010 at 19:45 UTC | |
by graff (Chancellor) on Apr 18, 2010 at 04:08 UTC | |
by Jim (Curate) on Apr 19, 2010 at 02:59 UTC | |
by graff (Chancellor) on Jun 14, 2010 at 00:17 UTC |