in reply to How can I tell if a string contains binary data or plain-old text?

You can't do it yourself easily, though there are tricks. If you know that it's either Unicode OR a JPEG, you can look for the JPEG header, and rule JPEG out if the header isn't found. Or if you're limiting the text to standard ASCII, you can probably be pretty certain it's text if each byte's value is 127 or less. But that gets blown away if your text is 8-bit MIME or Unicode, or if you're looking at a UUEncoded file, which is a non-text entity encoded into 7-bit text-only characters for the purpose of easy SMTP transportability. A zipped or tarred file might look like binary data on the surface, but could contain a text file within. A UUEncoded file will look like text on the outside but may contain binary data within. Just like a JPEG looks like binary data on the outside and yet represents an image within.

The problem is that the more varients of "plain old text" you consider to be plain old text, the more difficult it becomes to distinguish it from non-text.

That being the case, you can guess based on various criteria.


Dave


"If I had my life to live over again, I'd be a plumber." -- Albert Einstein
  • Comment on Re: How can I tell if a string contains binary data or plain-old text?

Replies are listed 'Best First'.
Re: Re: How can I tell if a string contains binary data or plain-old text?
by Anonymous Monk on Oct 31, 2003 at 04:22 UTC
    Slightly better than excluding characters over 127, is excluding characters from 1 to 31 inclusive, since those aren't used in any single byte, 8 bit encodings. They also aren't used as the first bytes in the variable length encodings, although this requires parsing the symbols to figure out which are the first bytes.
    Of course a few control characters will occur legitimately in text strings (e.g. EOF), but the percentage will be tiny compared to the ~12.5% you expect in most binaries.
      There's no EOF character in the ASCII set. There might be some filesystems that require files to use a particular character to signal the end of a file (for instance, the SUB (aka ^Z) character has been used), but most modern filesystems record the size of the file as meta data (often called inodes) and don't need a certain character to be present.

      However, some characters in the range 00-1F are found in text files: carriage returns (^M), line feeds (^J), tabs (^I), bells (^G), form feeds (^L) and backspaces (^H). Theoretically, one could find vertical tabs (^K) in text files as well, but I've never knowingly encountered such a thing in a text file.

      Abigail