Markov Chain automata class

The Following code provides a class to create Markov chain based automata which role is to tell you, after reading a dictionary, whether words you give them are likely to be part of the language they studied.

I wrote it so I can use it to comb through dns queries logs to find anything that looks like an algorithmicaly generated domain name. have fun :D

package Security::Monitoring::Detection::Markov;

use 5.006;
use strict;
use warnings;
use Storable qw(nstore retrieve);
use Carp;
use Security::Monitoring::Utils;


our $dict_file = '/usr/share/dict/words';
=head1 NAME

Markov chain based automata class

=head1 VERSION

Version 0.01

=cut

our $VERSION = '0.01';


=head1 SYNOPSIS
This module allows you to summon Markov chain based automata
That, after being given a dictionary to learn from will
give marks to words based on how likely these words are to 
be part of the language used in the dictionary they learnt.



=head1 SUBROUTINES/METHODS

=head2 new
Summoning method.
give it a number (it will be a percentage) to determine
when the automaton will trigger an alert. This percentage is the devia
+tion
score from the mean of all learnt words.

eg 10% with a mean score of all words learnt of 1 will make the automa
+ton
trigger an alert if after examining a word it obtains a score that is 
+greate
or lower than his mean by at least 10%


=head3 use

my $automaton = $class->new($alert_level);
=cut

sub new {
    my $class = shift;
        my $alert_level = shift;
        my $self = {};
    bless $self, $class;
        $self->_init($alert_level);
    $self;
}


=head2 _init

separate init function to make inheritance easier
=cut
sub _init{
    my $self = shift;
    my $alert_level = shift;
    if( !defined($alert_level) || !($alert_level =~ /\d+/)){
        croak "alert leve undefined!\n";
    }
    my @keys = qw(t_number word_count mean_score t_table_ref alert_lev
+el
    false_positives);
    my %t_table;#hash that will later contain the transitions
    my %false_positives;#hash that will contian the false positives
    foreach my $key (@keys){
        $self->{$key} = 0;
    }
    $self->{t_table_ref} = \%t_table;
    $self->{false_positives} = \%false_positives;
}

=head2 learn

this method is used by an automaton to look through a dictionary in th
+e form of
multiple lines. beware that telling an already educated automaton to l
+earn will
change whatever results you obtain

this function will take a filehandle to the dictionary file (one word 
+per line)
and will proceed to populate the automaton t_table with it.

each object has a transition table wich is in face made of a number of
+ hash
table : first level one has one key for each element of the alphabet a
+nd another
one for the empty word, each value is a reference to another hashtable
containing as keys the letter to which the transition is mad and as va
+lue the
number of transitions made to that key

the last parameter is a boolean, leave it empty to indicate its a file
+ and not
just a string.

=head3 use
$automaton->learn($file_name[,$is_string]);
if you do not provide the $is_string parameter then it will behave as 
+if
reading from a file

=cut

sub learn {
    my ($self,$file_name,$is_string) = @_;
    my $fh;
    my $hash = $self->{t_table_ref};
    my $words_learnt = 0;

        #chooses whether we ar elearning from a string or a file
    if(!defined($is_string)){
                if (!defined($file_name)){
                    croak("using an undefined filename\n");
                }
        open $fh, '<',$file_name or croak qq(could not open dictionary
+ file
        $file_name :$!\n);
    }
    else{
                if (!defined $file_name){
                    croak "sorry, that string is undefined, can not re
+ad from
                    it\n";
                }
                elsif ($file_name eq ''){
                    croak "sorry, that string is empty, nothing to rea
+d!\n"
                }
        open $fh, '<',\$file_name;
    }
        #reads each word and update the transitiontable with it
    while(<$fh>){
        my $word = $_;
        chomp $word;
        $word =~ s/(\w)/lc($1)/ge;#put it in full lowercase
                $word =~ s/[^[[::alphanum::]]]//;#gets rid of non alph
+anumeric
                                                #symbols
        my @letters = split '',$word;
        my $last_letter = 'empty';#first iteration starts from the emp
+ty
                #letter
        for(my $i = 0; $i <= $#letters;$i++){
            my $cur_letter = $letters[$i];

                        #if there is no subkey in the transition table
+ then we start
                        #with one which value is set at 1
            $hash->{$last_letter}->{$cur_letter}  
                        = defined $hash->{$last_letter}->{$cur_letter}
+ 
                        ? $hash->{$last_letter}->{$cur_letter}++      
+:        1; 

            $self->{t_number}++;
            $last_letter = $cur_letter;
        }
        $words_learnt++;
    }
        my $total = 0;
        seek $fh,0,0;
        while (<$fh>){
            my $word = $_;
            chomp $word;
            $word =~ s/(\w)/lc($1)/ge;#put it in full lowercase
            $word =~ s/[^[[::alphanum::]]]//;#gets rid of non alphanum
+eric
            $total += $self->get_score($word);
        }
            $self->{mean_score} =
                (($self->{mean_score} * $self->{word_count})
                +($total * $words_learnt))
                / ($self->{word_count} + $words_learnt);
    
            $self->{word_count}+=$words_learnt;

    close $fh;
    return $words_learnt;
}

=head2 _transitions
getter for t_number field
=cut
sub _transitions{
    my $self = shift;
    $self->{t_number};
}

=head2 get_score
This method is used once the automaton has learnt enough words (enough
+ being
left to the judgement of the user, dictionary files are a good source 
+I reckon)
using it it will give you a score that will determine wether that lett
+er
ordering is unusual (the higher the score the most unusual it will be)
+. The
score should be around 0 for known words, normal transitions such as c
+onsumn to
vowels should not add too much to the score. strange ones( such as
xxxbxczbbwdx)will do. Keep in mind that a completely unknown transitio
+n
will give one point, anything else will give a score calculated using 
+the
following formula : 

1 - (number of time this transition already occured / total of transit
+ions
recorded);


=head3 use

$automaton->get_score($word);
=cut

sub get_score{
    my ($self,$word) = @_;
    my $score = 0;
    chomp $word;
    $word =~ s/(\w)/lc($1)/ge;
        $word =~ s/[^[[::alphanum::]]]//;#gets rid of non alphanumeric
                                        #symbols

        if(defined $self->{false_positives}->{$word}){
            return $self->{mean_score};
        }

    my @letters = split('',$word);
    my $last_letter = "empty";

    foreach my $l (@letters){
        $score 
                += defined $self->{t_table_ref}->{$last_letter}->{$l}?
            $self->{t_table_ref}->{$last_letter}->{$l} / $self->{t_num
+ber}:    1;
        $last_letter = $l;
    }    
    return $score;
}


=head2 flag_as_false_positive
    this function allows you to flag a specific word as a false positi
+ve
    afterward whenever the autmaton tries to calculate the score of th
+e word
    it will return his mean score to prevent triggering an alert

=cut

sub flag_as_false_positive{
    my($self,$word) = @_;
    $word =~ s/(\w)/lc($1)/ge;
    $word =~ s/[^[[::alphanum::]]]//;#gets rid of non alphanumeric
                                    #symbols
    $self->{false_positives}->{$word} = 1;
}

=head2 get_false_positives
    returns a reference to a list of flagged false positives from the 
+automata
=cut

sub get_false_positives{
    my $self = shift;
    my @false_positives = keys %{$self->{false_positives}};
    return \@false_positives;
}
=head2 sys_learn

this subroutines uses the learn function on the system's /usr/share/di
+ct/words
file if it exists. if it does not it will return 0. If it does it will
+ return
the number of learnt words

if the file is empty it will return -1;

=head3 use
    $automaton->sys_learn;
=cut

sub sys_learn{
    my $self =  shift;
    if (!ref $self){
        croak("can not syslearn as class!\n");
    }
    my $file_name = $dict_file;
    if (-e $file_name){
        my $learnt = $self->learn($file_name);
        return $learnt;
    }
    else{
       croak "bad package var dict_file\n";
    }
}

1; # End of Markov
[download]

And here is the test file : 100% testcover :)

#!perl -T
use 5.006;
use strict;
use warnings;
use Test::More;
use Test::Exception;
use diagnostics;
use Security::Monitoring::Detection::Markov;
use Security::Monitoring::Utils;


BEGIN {
    plan tests => 25;
    use_ok( 'Security::Monitoring::Detection::Markov' ) || print "Bail
+ out!\n";
}
our $dict_file;
my $class='Security::Monitoring::Detection::Markov';
#setup and test function existence
diag( "Testing Markov module
$Security::Monitoring::Detection::Markov::VERSION, Perl $], $^X" );
ok(defined(Security::Monitoring::Detection::Markov->new(10)),"automata
+ generator
working");
my $automaton = Security::Monitoring::Detection::Markov->new(10);
ok(ref $automaton,"generator correctly returned a ref to the new autom
+aton");
my $learnstring = "test_dict";
open my $fh, '>',$learnstring; 
print $fh "dd\nff";
close $fh;

dies_ok(sub {my $new = $class->new(undef)},"new dies with undef alert_
+level");
dies_ok(sub {my $new = $class->new('trololol')},"new dies with non int
alert_level");
#tests learning
dies_ok(sub {$automaton->learn('trololol');},"learn dies with incorrec
+t filename");
dies_ok(sub {$automaton->learn(undef);},"learn dies with undef filenam
+e");
dies_ok(sub{$automaton->learn(undef,1)},"learn dies with undef string"
+);
dies_ok(sub {$automaton->learn('',1)},"learn dies with empty string");

#undefines all env variables to prevent taint warning
flush_env;
`/usr/bin/touch emptyfile`;
dies_ok(sub{$automaton->learn('emptyfile')},"learn dies with empty fil
+e");
is($automaton->learn($learnstring),2,"automaton learnt 2 words from fi
+le");
unlink $learnstring;
is($automaton->_transitions,4,"automaton has 4 transitions");
is($automaton->get_score("mama"),4,"automaton scored 4 on a four new l
+etters word");
is($automaton->get_score("dd"),0.5,"automaton scored 0.5 on one of the
+ two
learnt words");
is($automaton->learn("tata\ntutu",1),2,"automaton learnt two words fro
+m string");
my $emptymaton = Security::Monitoring::Detection::Markov->new(10);
$ENV{PATH} = '/usr/bin';
my $dict_words_number = `wc -l /usr/share/dict/words`;
chomp $dict_words_number;
$dict_words_number =~ s/[\D]//g;

dies_ok(sub{$class->sys_learn;},"sys learn dies when called as class m
+ethod");
my $tmp = $Security::Monitoring::Detection::Markov::dict_file;
dies_ok(sub{
            $Security::Monitoring::Detection::Markov::dict_file = 'tro
+lolol';
            my $auto = $class->new;
            $auto->sys_learn;
        },"sys learn dies with dummy dict file name");
dies_ok(sub{
            $Security::Monitoring::Detection::Markov::dict_file = 'emp
+tyfile';
            my $auto = $class->new;
            $auto->sys_learn;
        },"sys learn dies with empty dict file ");


unlink 'emptyfile';
$Security::Monitoring::Detection::Markov::dict_file= $tmp;
is($emptymaton->sys_learn,$dict_words_number,"sys_learn works correctl
+y");
is($emptymaton->{word_count},$dict_words_number,"automaton correctly r
+eports
number of words learnt");

#tests save and load
ok(($automaton->save("automaton.at") 
    =~ /automaton\.at/) 
    && (-e "automaton.at"),"automaton correctly saved to disk");
my $newtomaton = $class->load("automaton.at");
ok(defined($newtomaton),"automaton reloaded");
unlink "automaton.at";
$newtomaton->{empty}->{a}+=1;
my $autsub = $class->new(10);
my $autsub2 = $class->new(10);
my $hash1 = {a=>1};
my $hash2 = {b=>2};
$autsub->{t_table}->{x} = $hash1;
$autsub2->{t_table}->{x} = $hash2;

#tests false positive handling
my $word = 'xxxxxxxmmmmmnnnxyxyxyxynmnmnmn';
ok($newtomaton->flag_as_false_positive($word),"false positive flagging
+ works");
is(${$newtomaton->get_false_positives}[0],$word,"get_false_positive wo
+rks");
is($newtomaton->get_score($word),$newtomaton->{mean_score},"false posi
+tive
evaluation returns correct value");
[download]

Comment on Markov Chain automata class Select or Download Code

Replies are listed 'Best First'.
Re: Markov Chain automata class by VinsWorldcom (Prior) on Aug 24, 2015 at 18:41 UTC
First glaring error - 'use Security::Monitoring::Utils;' - that isn't included so I can't run it. When I comment that out, it seems to run fine. I'd give your POD a review with 'podcheker'. Run 'podchecker -w -w filename'; I get a lot of errors running your POD through it. Once that's corrected, adding to / organizing your POD so it's helpful to a first time user would be great. I was able to deduce from trial and error the basics I needed to get something running (see below), but a SYNOPSIS section usually includes this. What's in your SYNOPSIS section is usually put under a DESCRIPTION section. FYI, I ran this on Windows x64 with Strawberry 5.18.1 (64-bit) and it ran fine with my \usr\ install from GnuWin32 (which included /usr/share/dict/words'. `my $sec = Security::Monitoring::Detection::Markov->new(10); $sec->sys_learn(); print "finger = " . $sec->get_score('finger') . "\n"; print "alkajsdfa = " . $sec->get_score('alkajsdfa') . "\n"; print "xxxphotoblatxxx = " . $sec->get_score('xxxphotoblatxxx') . "\n" +;` [download] Output: `finger = 1.89777628064316e-006 alkajsdfa = 2.84666442096473e-006 xxxphotoblatxxx = 4.74444070160789e-006` [download]	[reply] [d/l] [select]
Re^2: Markov Chain automata class by QuillMeantTen (Friar) on Aug 24, 2015 at 18:47 UTC
Thank you for your input, I'm sorry about the Utils module, its just a wrapper around nstore and retrieve so I can call it from everywhere in my code I'll run it through podchecker tomorrow and correct those silly mistakes I'll also reorganise the documentation to make the synopsis part really useful I'm most interested in your feedback about the scores, I have no idea what kind of threshold should be used for alerts, whether 10% is or ain't enough. I was thinking about studying the distribution of scores for the words of the dictionary around the mean (hopefuly looking for anything that could tell me it follow a normal distribution) but I'm not sure whether its the way to go	[reply]
Re^3: Markov Chain automata class by VinsWorldcom (Prior) on Aug 24, 2015 at 18:51 UTC
Other than a cursory knowledge of the term, I needed Wikipedia to help with Markov chaining and the likes you talk about. I'm afraid I can't be of any help on the scores output, simply the overall architecture and user-friendliness of the module.	[reply]
Re^4: Markov Chain automata class by QuillMeantTen (Friar) on Aug 25, 2015 at 07:39 UTC