in reply to Is utf8, ascii ?
If the file data is already in utf8, you should be able to do
and that would show you all the distinct unicode characters in the file, one per line (with frequency of occurrence and hex code-point value for each character).unichist -x file.name
But if you see lots of "Malformed UTF-8" messages, the data is encoded in some other (non-unicode) character set. You can use a command line option to try different encodings on input until you hit on the one that works for your data (the script uses Encode to apply input decoding if the "-r enc" option is given):
The Encode man page tells how to get a listing of available character sets (or you can look at yet another tool I posted -- grepp -- Perl version of grep -- to see how to list the encodings).unichist -x -r euc-jp file.name ... # if you see errors or lots of "FFFD" characters, you guessed wron +g unichist -x -r shiftjis file.name ...
|
|---|