in reply to How can I tell if a string contains binary data or plain-old text?

There is no single, simple answer to this question. In one sense, "plain-old text" is arbitrary binary data, unless you happen to know the human language the text is written in, and are reasonably sure that the text represents correct usage in that language with few or no typos, or occasional words quoted/borrowed from some other language, or line noise or other sort of corruption, etc. If the text is in a language that uses characters beyond 7-bit ASCII, the distinction between "text" and "not text" can be slippery.

One general approach is to develop a statistical model of what you consider to be "text". Text data in any human language will have a fairly distinctive distribution of byte values, when compared to any non-linguistic data stream (including text that has been compressed, encrypted, and/or encoded via base64, uuencode, etc) -- or when compared to some other language, or when compared to data in the same lanuage when some alternate character encoding is used (e.g. CP437 vs. Latin1 vs. Unicode UC-16).

That is, the relative probabilities of the 256 different byte values will be quite distinctive for a given language, using a given character encoding. Of course, the limitations are: classification is less reliable on short strings (but any test case of more than 60 bytes should be pretty robust); you need to have enough valid text data to build a decent model; and if you need to recognize "plain text" in different languages, or using different character encodings, you need separate models for each type of "target" you want to recognize. It also helps if you can build a relevant model of the "non-text" data you are likely to encounter. (If your model is based on bigrams -- i.e. the probabilities of byte pairs -- it can be much more powereful and accurate, but then you have 64K probabilities to keep track of, instead of 256.)

Maybe this is not the sort of answer you were looking for? In any case, statistical classification methods are expected to be wrong some percentage of the time (both false positives and false negatives), and the vagaries of "text data" can often pose difficult boundary cases, like strings that contain some text, and some stuff that isn't text (e.g. the kind of crap you find in M$ Word "doc" files).

  • Comment on Re: How can I tell if a string contains binary data or plain-old text?