comment on

Here's a simple one-liner for checking the distribution of byte values in any given data stream or (set of) file(s) -- I'm using quoting that assumes a bash shell:

perl -ne '$c[$_]++ for (unpack("C*"));
 END{printf( "%10d %02x\n",$c[$_], $_ ) for (0..255)}'
[download]

You can either prefix that with cat * | (where * would match one or more files of interest), or append one or more file names of interest after the close quote. As indicated in the END block, the output will be a list of 256 lines, with two tokens per line:

 (# of bytes) (byte value)
[download]

where "byte value" (2nd column) ranges from 00 to ff, and the first column tells you how often the given byte value occurs in the data. If it's really 7-bit ascii text, then all the byte values from "80" through "ff" will have zeros in front of them.

With a little practice on different types of files, it's easy to notice patterns that distinguish various types of data -- e.g. UTF-16 with lots of characters in the 0000-00FF range is easy to spot due to having about half the data showing up as null bytes (00); UTF-8 will have various patterns depending on the language of the text, but something the alphabetic languages (Latin, Cyrillic, Greek, Arabic) have in common is one or two byte values in the c0-ff range showing up a lot, plus a similar quantity of values spread out in the 80-bf range.

Single-byte encodings (cp125*, iso-8859-*) are likewise distinctive -- they all have a sparse scattering in the a0-ff range (except Arabic, which is mostly in that range); but cp125* uses 80-9f as well, where iso-8859-* does not. You can also see quickly whether there are carriage returns in the data (0d), and if so, whether they match the quantity of line feeds (0a). If the data is supposed to be a tab-delimited table, you can check whether the number of tabs (09) divides evenly into the number of line feeds, and so on.

If you're going to use this sort of diagnostic a lot (I certainly do), it'll be worth while turning it into a general utility script so you can spruce it up a bit -- handle command-line options to allow printing as a 16x16 grid instead of 256 lines; optionally print summaries (how many bytes in the 80-ff range, how many in the a0-ff range, how many white-space, etc).

In reply to Re: unknown encoding by graff
in thread unknown encoding by jimw54321

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.