I've posted a couple of unicode-related utilities here at the monastery:
unichist -- count/summarize characters in data and
tlu -- TransLiterate Unicode. The first one might be enough for you to figure out what sort of data you have in your files.
If the file data is already in utf8, you should be able to do
unichist -x file.name
and that would show you all the distinct unicode characters in the file, one per line (with frequency of occurrence and hex code-point value for each character).
But if you see lots of "Malformed UTF-8" messages, the data is encoded in some other (non-unicode) character set. You can use a command line option to try different encodings on input until you hit on the one that works for your data (the script uses Encode to apply input decoding if the "-r enc" option is given):
unichist -x -r euc-jp file.name
... # if you see errors or lots of "FFFD" characters, you guessed wron
+g
unichist -x -r shiftjis file.name
...
The
Encode man page tells how to get a listing of available character sets (or you can look at yet another tool I posted --
grepp -- Perl version of grep -- to see how to list the encodings).
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.