The Following code provides a class to create Markov chain based automata which role is to tell you, after reading a dictionary, whether words you give them are likely to be part of the language they studied.

I wrote it so I can use it to comb through dns queries logs to find anything that looks like an algorithmicaly generated domain name. have fun :D

package Security::Monitoring::Detection::Markov; use 5.006; use strict; use warnings; use Storable qw(nstore retrieve); use Carp; use Security::Monitoring::Utils; our $dict_file = '/usr/share/dict/words'; =head1 NAME Markov chain based automata class =head1 VERSION Version 0.01 =cut our $VERSION = '0.01'; =head1 SYNOPSIS This module allows you to summon Markov chain based automata That, after being given a dictionary to learn from will give marks to words based on how likely these words are to be part of the language used in the dictionary they learnt. =head1 SUBROUTINES/METHODS =head2 new Summoning method. give it a number (it will be a percentage) to determine when the automaton will trigger an alert. This percentage is the devia +tion score from the mean of all learnt words. eg 10% with a mean score of all words learnt of 1 will make the automa +ton trigger an alert if after examining a word it obtains a score that is +greate or lower than his mean by at least 10% =head3 use my $automaton = $class->new($alert_level); =cut sub new { my $class = shift; my $alert_level = shift; my $self = {}; bless $self, $class; $self->_init($alert_level); $self; } =head2 _init separate init function to make inheritance easier =cut sub _init{ my $self = shift; my $alert_level = shift; if( !defined($alert_level) || !($alert_level =~ /\d+/)){ croak "alert leve undefined!\n"; } my @keys = qw(t_number word_count mean_score t_table_ref alert_lev +el false_positives); my %t_table;#hash that will later contain the transitions my %false_positives;#hash that will contian the false positives foreach my $key (@keys){ $self->{$key} = 0; } $self->{t_table_ref} = \%t_table; $self->{false_positives} = \%false_positives; } =head2 learn this method is used by an automaton to look through a dictionary in th +e form of multiple lines. beware that telling an already educated automaton to l +earn will change whatever results you obtain this function will take a filehandle to the dictionary file (one word +per line) and will proceed to populate the automaton t_table with it. each object has a transition table wich is in face made of a number of + hash table : first level one has one key for each element of the alphabet a +nd another one for the empty word, each value is a reference to another hashtable containing as keys the letter to which the transition is mad and as va +lue the number of transitions made to that key the last parameter is a boolean, leave it empty to indicate its a file + and not just a string. =head3 use $automaton->learn($file_name[,$is_string]); if you do not provide the $is_string parameter then it will behave as +if reading from a file =cut sub learn { my ($self,$file_name,$is_string) = @_; my $fh; my $hash = $self->{t_table_ref}; my $words_learnt = 0; #chooses whether we ar elearning from a string or a file if(!defined($is_string)){ if (!defined($file_name)){ croak("using an undefined filename\n"); } open $fh, '<',$file_name or croak qq(could not open dictionary + file $file_name :$!\n); } else{ if (!defined $file_name){ croak "sorry, that string is undefined, can not re +ad from it\n"; } elsif ($file_name eq ''){ croak "sorry, that string is empty, nothing to rea +d!\n" } open $fh, '<',\$file_name; } #reads each word and update the transitiontable with it while(<$fh>){ my $word = $_; chomp $word; $word =~ s/(\w)/lc($1)/ge;#put it in full lowercase $word =~ s/[^[[::alphanum::]]]//;#gets rid of non alph +anumeric #symbols my @letters = split '',$word; my $last_letter = 'empty';#first iteration starts from the emp +ty #letter for(my $i = 0; $i <= $#letters;$i++){ my $cur_letter = $letters[$i]; #if there is no subkey in the transition table + then we start #with one which value is set at 1 $hash->{$last_letter}->{$cur_letter} = defined $hash->{$last_letter}->{$cur_letter} + ? $hash->{$last_letter}->{$cur_letter}++ +: 1; $self->{t_number}++; $last_letter = $cur_letter; } $words_learnt++; } my $total = 0; seek $fh,0,0; while (<$fh>){ my $word = $_; chomp $word; $word =~ s/(\w)/lc($1)/ge;#put it in full lowercase $word =~ s/[^[[::alphanum::]]]//;#gets rid of non alphanum +eric $total += $self->get_score($word); } $self->{mean_score} = (($self->{mean_score} * $self->{word_count}) +($total * $words_learnt)) / ($self->{word_count} + $words_learnt); $self->{word_count}+=$words_learnt; close $fh; return $words_learnt; } =head2 _transitions getter for t_number field =cut sub _transitions{ my $self = shift; $self->{t_number}; } =head2 get_score This method is used once the automaton has learnt enough words (enough + being left to the judgement of the user, dictionary files are a good source +I reckon) using it it will give you a score that will determine wether that lett +er ordering is unusual (the higher the score the most unusual it will be) +. The score should be around 0 for known words, normal transitions such as c +onsumn to vowels should not add too much to the score. strange ones( such as xxxbxczbbwdx)will do. Keep in mind that a completely unknown transitio +n will give one point, anything else will give a score calculated using +the following formula : 1 - (number of time this transition already occured / total of transit +ions recorded); =head3 use $automaton->get_score($word); =cut sub get_score{ my ($self,$word) = @_; my $score = 0; chomp $word; $word =~ s/(\w)/lc($1)/ge; $word =~ s/[^[[::alphanum::]]]//;#gets rid of non alphanumeric #symbols if(defined $self->{false_positives}->{$word}){ return $self->{mean_score}; } my @letters = split('',$word); my $last_letter = "empty"; foreach my $l (@letters){ $score += defined $self->{t_table_ref}->{$last_letter}->{$l}? $self->{t_table_ref}->{$last_letter}->{$l} / $self->{t_num +ber}: 1; $last_letter = $l; } return $score; } =head2 flag_as_false_positive this function allows you to flag a specific word as a false positi +ve afterward whenever the autmaton tries to calculate the score of th +e word it will return his mean score to prevent triggering an alert =cut sub flag_as_false_positive{ my($self,$word) = @_; $word =~ s/(\w)/lc($1)/ge; $word =~ s/[^[[::alphanum::]]]//;#gets rid of non alphanumeric #symbols $self->{false_positives}->{$word} = 1; } =head2 get_false_positives returns a reference to a list of flagged false positives from the +automata =cut sub get_false_positives{ my $self = shift; my @false_positives = keys %{$self->{false_positives}}; return \@false_positives; } =head2 sys_learn this subroutines uses the learn function on the system's /usr/share/di +ct/words file if it exists. if it does not it will return 0. If it does it will + return the number of learnt words if the file is empty it will return -1; =head3 use $automaton->sys_learn; =cut sub sys_learn{ my $self = shift; if (!ref $self){ croak("can not syslearn as class!\n"); } my $file_name = $dict_file; if (-e $file_name){ my $learnt = $self->learn($file_name); return $learnt; } else{ croak "bad package var dict_file\n"; } } 1; # End of Markov

And here is the test file : 100% testcover :)

#!perl -T use 5.006; use strict; use warnings; use Test::More; use Test::Exception; use diagnostics; use Security::Monitoring::Detection::Markov; use Security::Monitoring::Utils; BEGIN { plan tests => 25; use_ok( 'Security::Monitoring::Detection::Markov' ) || print "Bail + out!\n"; } our $dict_file; my $class='Security::Monitoring::Detection::Markov'; #setup and test function existence diag( "Testing Markov module $Security::Monitoring::Detection::Markov::VERSION, Perl $], $^X" ); ok(defined(Security::Monitoring::Detection::Markov->new(10)),"automata + generator working"); my $automaton = Security::Monitoring::Detection::Markov->new(10); ok(ref $automaton,"generator correctly returned a ref to the new autom +aton"); my $learnstring = "test_dict"; open my $fh, '>',$learnstring; print $fh "dd\nff"; close $fh; dies_ok(sub {my $new = $class->new(undef)},"new dies with undef alert_ +level"); dies_ok(sub {my $new = $class->new('trololol')},"new dies with non int alert_level"); #tests learning dies_ok(sub {$automaton->learn('trololol');},"learn dies with incorrec +t filename"); dies_ok(sub {$automaton->learn(undef);},"learn dies with undef filenam +e"); dies_ok(sub{$automaton->learn(undef,1)},"learn dies with undef string" +); dies_ok(sub {$automaton->learn('',1)},"learn dies with empty string"); #undefines all env variables to prevent taint warning flush_env; `/usr/bin/touch emptyfile`; dies_ok(sub{$automaton->learn('emptyfile')},"learn dies with empty fil +e"); is($automaton->learn($learnstring),2,"automaton learnt 2 words from fi +le"); unlink $learnstring; is($automaton->_transitions,4,"automaton has 4 transitions"); is($automaton->get_score("mama"),4,"automaton scored 4 on a four new l +etters word"); is($automaton->get_score("dd"),0.5,"automaton scored 0.5 on one of the + two learnt words"); is($automaton->learn("tata\ntutu",1),2,"automaton learnt two words fro +m string"); my $emptymaton = Security::Monitoring::Detection::Markov->new(10); $ENV{PATH} = '/usr/bin'; my $dict_words_number = `wc -l /usr/share/dict/words`; chomp $dict_words_number; $dict_words_number =~ s/[\D]//g; dies_ok(sub{$class->sys_learn;},"sys learn dies when called as class m +ethod"); my $tmp = $Security::Monitoring::Detection::Markov::dict_file; dies_ok(sub{ $Security::Monitoring::Detection::Markov::dict_file = 'tro +lolol'; my $auto = $class->new; $auto->sys_learn; },"sys learn dies with dummy dict file name"); dies_ok(sub{ $Security::Monitoring::Detection::Markov::dict_file = 'emp +tyfile'; my $auto = $class->new; $auto->sys_learn; },"sys learn dies with empty dict file "); unlink 'emptyfile'; $Security::Monitoring::Detection::Markov::dict_file= $tmp; is($emptymaton->sys_learn,$dict_words_number,"sys_learn works correctl +y"); is($emptymaton->{word_count},$dict_words_number,"automaton correctly r +eports number of words learnt"); #tests save and load ok(($automaton->save("automaton.at") =~ /automaton\.at/) && (-e "automaton.at"),"automaton correctly saved to disk"); my $newtomaton = $class->load("automaton.at"); ok(defined($newtomaton),"automaton reloaded"); unlink "automaton.at"; $newtomaton->{empty}->{a}+=1; my $autsub = $class->new(10); my $autsub2 = $class->new(10); my $hash1 = {a=>1}; my $hash2 = {b=>2}; $autsub->{t_table}->{x} = $hash1; $autsub2->{t_table}->{x} = $hash2; #tests false positive handling my $word = 'xxxxxxxmmmmmnnnxyxyxyxynmnmnmn'; ok($newtomaton->flag_as_false_positive($word),"false positive flagging + works"); is(${$newtomaton->get_false_positives}[0],$word,"get_false_positive wo +rks"); is($newtomaton->get_score($word),$newtomaton->{mean_score},"false posi +tive evaluation returns correct value");

Replies are listed 'Best First'.
Re: Markov Chain automata class
by VinsWorldcom (Prior) on Aug 24, 2015 at 18:41 UTC
    1. First glaring error - 'use Security::Monitoring::Utils;' - that isn't included so I can't run it. When I comment that out, it seems to run fine.

    2. I'd give your POD a review with 'podcheker'. Run 'podchecker -w -w filename'; I get a lot of errors running your POD through it.

    3. Once that's corrected, adding to / organizing your POD so it's helpful to a first time user would be great. I was able to deduce from trial and error the basics I needed to get something running (see below), but a SYNOPSIS section usually includes this. What's in your SYNOPSIS section is usually put under a DESCRIPTION section.

    FYI, I ran this on Windows x64 with Strawberry 5.18.1 (64-bit) and it ran fine with my \usr\ install from GnuWin32 (which included /usr/share/dict/words'.

    my $sec = Security::Monitoring::Detection::Markov->new(10); $sec->sys_learn(); print "finger = " . $sec->get_score('finger') . "\n"; print "alkajsdfa = " . $sec->get_score('alkajsdfa') . "\n"; print "xxxphotoblatxxx = " . $sec->get_score('xxxphotoblatxxx') . "\n" +;

    Output:

    finger = 1.89777628064316e-006 alkajsdfa = 2.84666442096473e-006 xxxphotoblatxxx = 4.74444070160789e-006

      Thank you for your input, I'm sorry about the Utils module, its just a wrapper around nstore and retrieve so I can call it from everywhere in my code

      I'll run it through podchecker tomorrow and correct those silly mistakes
      I'll also reorganise the documentation to make the synopsis part really useful

      I'm most interested in your feedback about the scores, I have no idea what kind of threshold should be used for alerts, whether 10% is or ain't enough.

      I was thinking about studying the distribution of scores for the words of the dictionary around the mean (hopefuly looking for anything that could tell me it follow a normal distribution) but I'm not sure whether its the way to go

        Other than a cursory knowledge of the term, I needed Wikipedia to help with Markov chaining and the likes you talk about. I'm afraid I can't be of any help on the scores output, simply the overall architecture and user-friendliness of the module.