Re: What data is code?

One possibility is to check character frequency. Your semicolon idea sounds good, if the number of semicolons is comparable to the number of newlines (and "\n" isn't encoded), chances are it's code. if newlines get encoded, too, lines will be unusually long.

The primary single measure of language-like structure is called Friedman kappa. That is an index of coincidence. Scan the text, looking at each character and the character a fixed distance ahead. Increment a count if the two are equal. Score that as a percentage. Random or well-encoded text will score about 1/ alphabet length. The redundancy of useful language leads to an index in the range .05-.08 for natural language.

After Compline,
Zaxo

Comment on Re: What data is code?