We have done a lot of work with Bayes and then Fisher/Robinson which is a similar but significantly different algorithm. The basis on which the whole thing revolves round is having 2 sets of data IN_CLASS and NOT_IN_CLASS for want of better terms. Spam is the simplest example as you can divide email into GOOD and SPAM and that is all you want. By analysing token frequency in an UNKOWN stream you can make a statistically valid prediction about if 1) the UNKNOWN stream is statistically similar to the CLASS or 2) it is not similar. So you have a discriminator between two classes but that is all it really is. You can extend this to multiple classes but it is still just a you look, smell and taste like a ..... or you don't kinda thing.
With a single syslog as data you have a major issue - you don't have a CLASS/NOT_CLASS problem. The problem is basically fundamentally different. What you want to do is examine a STREAM and compare that to past STREAM data looking for stuff that has not been seen (commonly) before.
Let's consider a syslog that contains lines of data. Say we have several lines likr:
Failed login by root from 1.2.3.4
Successful login by root from 1.2.3.4
Failed login by root from 2.3.4.5
Failed login by root from 2.3.4.5
Failed login by root from 2.3.4.5
Failed login by root from 2.3.4.5
Now we know what is interesting in this at a glance. The failed login from 1.2.3.4 was the real root mistyping their password but the login attempts from 2.3.4.5 are hack attempts. Herein lie lots of problems. If our training data set contains lots of failed logins from 2.3.4.5 a Bayes type system will 'learn' that as *normal* so in future that will get ignored. If on the other hand root hardly ever makes a mistake on login that event will be uncommon and thus likely to be sub threshold of normal events and get (uselessly) flagged. Also typical Bayes will tokenise. There is only a 15% difference between a failed login from 1 IP and another failed login in the example given above. Additionally there is multiline context - a successful login following a failed attempt is likely to be an honest mistake, multiple failures in sequence, not necessarily even on sequential lines is a concern.
Sure that is a fairly fake simplistic example but is does illustrate some of the issues.
I think the approach that is most likely to give the best results is a rules based skip approach using regexes. Initially your filter has no regexes. You add ones that allow you to skip all the common events, leaving the *unknown events* flagged. This general idea is of course already implemented by logwatch.
Although I have little doubt you could shoehorn the problem into Bayes space I don't think it is a good fit given what you want and how Bayes works.
|