Re: unknown encoding

Here's a simple one-liner for checking the distribution of byte values in any given data stream or (set of) file(s) -- I'm using quoting that assumes a bash shell:

perl -ne '$c[$_]++ for (unpack("C*"));
 END{printf( "%10d %02x\n",$c[$_], $_ ) for (0..255)}'
[download]

You can either prefix that with cat * | (where * would match one or more files of interest), or append one or more file names of interest after the close quote. As indicated in the END block, the output will be a list of 256 lines, with two tokens per line:

 (# of bytes) (byte value)
[download]

where "byte value" (2nd column) ranges from 00 to ff, and the first column tells you how often the given byte value occurs in the data. If it's really 7-bit ascii text, then all the byte values from "80" through "ff" will have zeros in front of them.

With a little practice on different types of files, it's easy to notice patterns that distinguish various types of data -- e.g. UTF-16 with lots of characters in the 0000-00FF range is easy to spot due to having about half the data showing up as null bytes (00); UTF-8 will have various patterns depending on the language of the text, but something the alphabetic languages (Latin, Cyrillic, Greek, Arabic) have in common is one or two byte values in the c0-ff range showing up a lot, plus a similar quantity of values spread out in the 80-bf range.

Single-byte encodings (cp125*, iso-8859-*) are likewise distinctive -- they all have a sparse scattering in the a0-ff range (except Arabic, which is mostly in that range); but cp125* uses 80-9f as well, where iso-8859-* does not. You can also see quickly whether there are carriage returns in the data (0d), and if so, whether they match the quantity of line feeds (0a). If the data is supposed to be a tab-delimited table, you can check whether the number of tabs (09) divides evenly into the number of line feeds, and so on.

If you're going to use this sort of diagnostic a lot (I certainly do), it'll be worth while turning it into a general utility script so you can spruce it up a bit -- handle command-line options to allow printing as a 16x16 grid instead of 256 lines; optionally print summaries (how many bytes in the 80-ff range, how many in the a0-ff range, how many white-space, etc).

Comment on Re: unknown encoding Select or Download Code