Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'd like to test a block of bytes for whether it appears to be "real" text or not. By that I mean, would a human judge it to be text, or "binary"? Conceptually, similar to what the -T/-B operators do, but without the heuristic part: it's either 100% Text, for sure, or else we call it Binary.

I'm pretty sure I could cobble together a regex that does what I want; but I thought there would be a character class (or at most two) which would Do The Right Thing. Unfortunately, there doesn't seem to be. IsAscii is too broad, as it covers a lot of "control" characters that we don't normally think of as being in text, particularly \000. IsPrint is too narrow, as it doesn't even cover <tab>.

Thanks in advance...

Replies are listed 'Best First'.
Re: How to match "text"?
by BrowserUk (Patriarch) on Jun 14, 2013 at 12:40 UTC

    How about:

    if( $s =~ m[^[ -~\t\n]+$] ) { print 'IsAscii'; } else { print 'IsNotAscii'; }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: How to match "text"? (bytes)
by Anonymous Monk on Jun 14, 2013 at 12:32 UTC

    There are only bytes, to decide how you interpret those bytes you have to use heuristics, or someone tells you -- there is no other way

Re: How to match "text"?
by Anonymous Monk on Jun 14, 2013 at 13:11 UTC
    FWIW, I ended up using
    /[[:print:][:space:]]/

      Without anchors that will just ignore the presence of control characters.

      And be aware that [:space:] includes chr(11) (vertical tab) & chr(12) (form feed) which most modern "text" programs have no idea what to do with.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Oh, yeah. I actually did negation, i.e.
        print "Is Non-Text" if [^[:print:][:space:]]/ ;