Re: Is utf8, ascii ?

I've posted a couple of unicode-related utilities here at the monastery: unichist -- count/summarize characters in data and tlu -- TransLiterate Unicode. The first one might be enough for you to figure out what sort of data you have in your files.

If the file data is already in utf8, you should be able to do

unichist -x file.name
[download]

and that would show you all the distinct unicode characters in the file, one per line (with frequency of occurrence and hex code-point value for each character).

But if you see lots of "Malformed UTF-8" messages, the data is encoded in some other (non-unicode) character set. You can use a command line option to try different encodings on input until you hit on the one that works for your data (the script uses Encode to apply input decoding if the "-r enc" option is given):

unichist -x -r euc-jp  file.name
... # if you see errors or lots of "FFFD" characters, you guessed wron
+g

unichist -x -r shiftjis file.name
...
[download]

The Encode man page tells how to get a listing of available character sets (or you can look at yet another tool I posted -- grepp -- Perl version of grep -- to see how to list the encodings).

Comment on Re: Is utf8, ascii ? Select or Download Code