Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Sniffing binary data, heuristics?

by Ryszard (Priest)
on May 25, 2004 at 18:53 UTC ( #356334=perlquestion: print w/replies, xml ) Need Help??

Ryszard has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a packet sniffer that will allow me to examine the payload(s) of a TCP stream.

Of course, being TCP, there are many different things that can go on during a snoop, so what i'd like to do is find a nice fast way to identify the payload type of a TCP packet.

Specifically i would like to be able to determine if the payload is binary, so i can re-assemble the packets later on for reconstruction.

Idealy I'd like to be able to determine the payload as binary at run time, altho' doing it off line is also an option.

I've been thinking about grabbing the 1st "bunch" of characters, and determining if they're within the printable range, however this method is not fool proof. Doing a more extensive analysis using dipthongs/whitespace/vowels etc i think would be too slow.

I've not written any code (as yet), i'm just looking for suggestions...

Replies are listed 'Best First'.
Re: Sniffing binary data, heuristics?
by diotalevi (Canon) on May 25, 2004 at 19:16 UTC
    I'd try /[^:[print]:]/ ? 'non-printable' : 'printable' and maybe not even go farther if the performance is ok.
      This interests me as well. Could you elaborate on what  /[^:[print]:]/ does?
        That's a typo :). What diotalevi meant to write was /[^[:print:]]/ and it means (as per `perldoc perlre'):
        C:\>perl -MYAPE::Regex::Explain -le"print YAPE::Regex::Explain->new(qr +/[^[:print:]]/)->explain" The regular expression: (?-imsx:[^[:print:]]) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- [^[:print:]] any character except: alphanumeric, punctuation, and whitespace characters ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

        MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
        I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
        ** The third rule of perl club is a statement of fact: pod is sexy.

Re: Sniffing binary data, heuristics?
by graff (Chancellor) on May 26, 2004 at 06:10 UTC
    I've been thinking about grabbing the 1st "bunch" of characters, and determining if they're within the printable range, however this method is not fool proof.

    heh... Printable in what language(s), using what character encoding(s)?

    Does uuencoded or base64 encoded data count as "printable", or as "binary"?

    I'll confess to being clueless about the details of TCP, and ask a presumably stupid question: how would the re-assembly of packets need to differ, based on whether or not you consider the content to be "binary"?

    Doing a more extensive analysis using dipthongs/whitespace/vowels etc i think would be too slow.

    If you're talking about deciding whether or not the content would qualify as "human-readable text", again, I'm hindered by ignorance of TCP (what's the typical packet size?) -- and I'd have to repeat the earlier questions (which human language(s)? which character encoding(s)?) -- but modeling readable text, in terms of the relative probabilities of occurrence for individual characters, would not be very hard, and could be quite robust with test strings of as few as 32 characters (the more, the better, of course).

    Essentially, you "train" one model on some suitably large set of known human-readable text (just 10K words would probably do), consisting of the probabilities for each printable character; then train another model on a (preferably larger) set of data known to contain little or no readable text (or maybe just assume equal probabilities for all printable byte values).

    For a given stream of input data to be classified, if it contains non-printables, it's probably not text and you're probably done; but if it contains only printable characters (e.g. could be base64 encoded), compute the relative proporions of occurrence over the set of printable characters, and measure the error between these proportions and each of the two models. If the error relative to the human-readable model is significantly lower, the input is human readable. (Unless of course it's spam, which is often tailored to match the unigram character statistics of a language, without regard to readability...)

    If you are worried about speed, though, you'd be better off doing it in C rather than Perl.

    (update: In case the question about character encoding didn't make this clear: the modeling of human-readable text would need to be limited to a training set that was homogeneous, at least with respect to character encoding. If your "human" training data includes a mix of UTF16, UTF8, GB2312, Big5, ShiftJIS, etc, it's going to end up not that different from the "binary" model. And if we're talking about any flavor of unicode, you also need to limit yourself to a given language (or group of closely related languages) -- for one thing, the definition of 'what is printable' varies widely...)

Re: Sniffing binary data, heuristics?
by NetWallah (Canon) on May 25, 2004 at 20:03 UTC
    How about using the tr operator to count non-printables. something like
    if tr/a-zA-Z0-9!@#$%^&*()_+-={}[]|\\,./<>?:";'//cs > 0 { # Has non-printables.... }
    Note: Concept tested, and works, - exact tr statement not tested.

    Offense, like beauty, is in the eye of the beholder, and a fantasy.
    By guaranteeing freedom of expression, the First Amendment also guarntees offense.
Re: Sniffing binary data, heuristics?
by DrHyde (Prior) on May 26, 2004 at 07:38 UTC
    I'm writing a packet sniffer

    Why? There are decent sniffers already written. You're better off, IMO, using one of them and then using perl to grovel over their log files rather than spending valuable time writing the sniffer as well.

      Quite simply, because i can. its an interesting thing to do for me.. :-)
        Take a look at Net::Pcap and Net::Packet as these may make your life a lot easier.

        I made a start on this but the semantics of separating TCP/IP streams between multiple clients and multiple servers in real time got very complicated.

        I can probably did out some ugly code if you want :)

        The 'Cat

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://356334]
Approved by Gerard
Front-paged by Gerard
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2022-12-10 09:49 GMT
Find Nodes?
    Voting Booth?

    No recent polls found