comment on

There is no single, simple answer to this question. In one sense, "plain-old text" is arbitrary binary data, unless you happen to know the human language the text is written in, and are reasonably sure that the text represents correct usage in that language with few or no typos, or occasional words quoted/borrowed from some other language, or line noise or other sort of corruption, etc. If the text is in a language that uses characters beyond 7-bit ASCII, the distinction between "text" and "not text" can be slippery.

One general approach is to develop a statistical model of what you consider to be "text". Text data in any human language will have a fairly distinctive distribution of byte values, when compared to any non-linguistic data stream (including text that has been compressed, encrypted, and/or encoded via base64, uuencode, etc) -- or when compared to some other language, or when compared to data in the same lanuage when some alternate character encoding is used (e.g. CP437 vs. Latin1 vs. Unicode UC-16).

That is, the relative probabilities of the 256 different byte values will be quite distinctive for a given language, using a given character encoding. Of course, the limitations are: classification is less reliable on short strings (but any test case of more than 60 bytes should be pretty robust); you need to have enough valid text data to build a decent model; and if you need to recognize "plain text" in different languages, or using different character encodings, you need separate models for each type of "target" you want to recognize. It also helps if you can build a relevant model of the "non-text" data you are likely to encounter. (If your model is based on bigrams -- i.e. the probabilities of byte pairs -- it can be much more powereful and accurate, but then you have 64K probabilities to keep track of, instead of 256.)

Maybe this is not the sort of answer you were looking for? In any case, statistical classification methods are expected to be wrong some percentage of the time (both false positives and false negatives), and the vagaries of "text data" can often pose difficult boundary cases, like strings that contain some text, and some stuff that isn't text (e.g. the kind of crap you find in M$ Word "doc" files).

In reply to Re: How can I tell if a string contains binary data or plain-old text? by graff
in thread How can I tell if a string contains binary data or plain-old text? by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.