vaxgeek has asked for the wisdom of the Perl Monks concerning the following question:

Has anyone ever looked into classifying syslog events with a Bayesian filter/text classifier? This would help you discard events that happen every day, i.e. cron reports, etc., but help you more readily be able to identify erroneous events.
  • Comment on Syslog event classification with Bayesian style filters

Replies are listed 'Best First'.
Re: Syslog event classification with Bayesian style filters
by tachyon (Chancellor) on Mar 14, 2004 at 23:39 UTC

    We have done a lot of work with Bayes and then Fisher/Robinson which is a similar but significantly different algorithm. The basis on which the whole thing revolves round is having 2 sets of data IN_CLASS and NOT_IN_CLASS for want of better terms. Spam is the simplest example as you can divide email into GOOD and SPAM and that is all you want. By analysing token frequency in an UNKOWN stream you can make a statistically valid prediction about if 1) the UNKNOWN stream is statistically similar to the CLASS or 2) it is not similar. So you have a discriminator between two classes but that is all it really is. You can extend this to multiple classes but it is still just a you look, smell and taste like a ..... or you don't kinda thing.

    With a single syslog as data you have a major issue - you don't have a CLASS/NOT_CLASS problem. The problem is basically fundamentally different. What you want to do is examine a STREAM and compare that to past STREAM data looking for stuff that has not been seen (commonly) before.

    Let's consider a syslog that contains lines of data. Say we have several lines likr:

    Failed login by root from 1.2.3.4 Successful login by root from 1.2.3.4 Failed login by root from 2.3.4.5 Failed login by root from 2.3.4.5 Failed login by root from 2.3.4.5 Failed login by root from 2.3.4.5

    Now we know what is interesting in this at a glance. The failed login from 1.2.3.4 was the real root mistyping their password but the login attempts from 2.3.4.5 are hack attempts. Herein lie lots of problems. If our training data set contains lots of failed logins from 2.3.4.5 a Bayes type system will 'learn' that as *normal* so in future that will get ignored. If on the other hand root hardly ever makes a mistake on login that event will be uncommon and thus likely to be sub threshold of normal events and get (uselessly) flagged. Also typical Bayes will tokenise. There is only a 15% difference between a failed login from 1 IP and another failed login in the example given above. Additionally there is multiline context - a successful login following a failed attempt is likely to be an honest mistake, multiple failures in sequence, not necessarily even on sequential lines is a concern.

    Sure that is a fairly fake simplistic example but is does illustrate some of the issues.

    I think the approach that is most likely to give the best results is a rules based skip approach using regexes. Initially your filter has no regexes. You add ones that allow you to skip all the common events, leaving the *unknown events* flagged. This general idea is of course already implemented by logwatch.

    Although I have little doubt you could shoehorn the problem into Bayes space I don't think it is a good fit given what you want and how Bayes works.

    cheers

    tachyon

Re: Syslog event classification with Bayesian style filters
by kvale (Monsignor) on Mar 14, 2004 at 21:41 UTC
    I think that one could certainly make this work. But pre-existing Bayesian filters are almost solely specialized for spam, so there would quite a bit of manual training of priors to reliably distinguish expected from surprising log entries. And because it is probabilistic, there will be a finite chance of false negatives, i.e., interesting events that get reported as boring.

    More generally, program-generated log entries have precise, repeatable formats that make them much easier to detect and parse than natural language emails. So creating a filter using regexps (and perhaps a little parsing) to chuck the boring bits of a log is easy enough and is probably less sysadmin effort than training a new Bayesian system.

    -Mark

      I really don't think this problem fits into Bayes space at all well. See Re: Syslog event classification with Bayesian style filters for more details. I totally agree with you that a regex based skip filter (a la logwatch) is not only the simplest but also likely to be the best approach.

      cheers

      tachyon

Re: Syslog event classification with Bayesian style filters
by matsmats (Monk) on Mar 15, 2004 at 12:10 UTC

    Try googling for "anomaly detection" - it's not such an flawed approach as some of the other posters suggest, and a Bayesian filter would probably be a helpful tool in detecting anomalies in your log file (be it intrusion attempts or error messages from whatever is running on your system)

    You could probably easily roll your own Bayes-implementation for this sort of use that fits your problem better than the existing implementations on CPAN - as kvale says, they are (it is) mainly specialized in categorizing spam.

    A challenge here could be that you can't necessarily just consider each event in the log as they arrive, but have to look at patterns and trends too to detect hack attempts and such. It's an interesting idea, please tell if something useful comes out of it.

    Mats