Try using this script on your data: unichist -- count/summarize characters in data -- it will show you a list of all the distinct code points, and how many times each one occurs. It expects utf8 text by default, but if your data comes in some other encoding, you can specify that in a command-line option ("--enc=..."); the output will always be in terms of unicode code points.
In reply to Re: regexing for non-standard characters...
by graff
in thread regexing for non-standard characters...
by emmiesix
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |