The Following code provides a class to create Markov chain based automata which role is to tell you, after reading a dictionary, whether words you give them are likely to be part of the language they studied.
I wrote it so I can use it to comb through dns queries logs to find anything that looks like an algorithmicaly generated domain name. have fun :D
package Security::Monitoring::Detection::Markov; use 5.006; use strict; use warnings; use Storable qw(nstore retrieve); use Carp; use Security::Monitoring::Utils; our $dict_file = '/usr/share/dict/words'; =head1 NAME Markov chain based automata class =head1 VERSION Version 0.01 =cut our $VERSION = '0.01'; =head1 SYNOPSIS This module allows you to summon Markov chain based automata That, after being given a dictionary to learn from will give marks to words based on how likely these words are to be part of the language used in the dictionary they learnt. =head1 SUBROUTINES/METHODS =head2 new Summoning method. give it a number (it will be a percentage) to determine when the automaton will trigger an alert. This percentage is the devia +tion score from the mean of all learnt words. eg 10% with a mean score of all words learnt of 1 will make the automa +ton trigger an alert if after examining a word it obtains a score that is +greate or lower than his mean by at least 10% =head3 use my $automaton = $class->new($alert_level); =cut sub new { my $class = shift; my $alert_level = shift; my $self = {}; bless $self, $class; $self->_init($alert_level); $self; } =head2 _init separate init function to make inheritance easier =cut sub _init{ my $self = shift; my $alert_level = shift; if( !defined($alert_level) || !($alert_level =~ /\d+/)){ croak "alert leve undefined!\n"; } my @keys = qw(t_number word_count mean_score t_table_ref alert_lev +el false_positives); my %t_table;#hash that will later contain the transitions my %false_positives;#hash that will contian the false positives foreach my $key (@keys){ $self->{$key} = 0; } $self->{t_table_ref} = \%t_table; $self->{false_positives} = \%false_positives; } =head2 learn this method is used by an automaton to look through a dictionary in th +e form of multiple lines. beware that telling an already educated automaton to l +earn will change whatever results you obtain this function will take a filehandle to the dictionary file (one word +per line) and will proceed to populate the automaton t_table with it. each object has a transition table wich is in face made of a number of + hash table : first level one has one key for each element of the alphabet a +nd another one for the empty word, each value is a reference to another hashtable containing as keys the letter to which the transition is mad and as va +lue the number of transitions made to that key the last parameter is a boolean, leave it empty to indicate its a file + and not just a string. =head3 use $automaton->learn($file_name[,$is_string]); if you do not provide the $is_string parameter then it will behave as +if reading from a file =cut sub learn { my ($self,$file_name,$is_string) = @_; my $fh; my $hash = $self->{t_table_ref}; my $words_learnt = 0; #chooses whether we ar elearning from a string or a file if(!defined($is_string)){ if (!defined($file_name)){ croak("using an undefined filename\n"); } open $fh, '<',$file_name or croak qq(could not open dictionary + file $file_name :$!\n); } else{ if (!defined $file_name){ croak "sorry, that string is undefined, can not re +ad from it\n"; } elsif ($file_name eq ''){ croak "sorry, that string is empty, nothing to rea +d!\n" } open $fh, '<',\$file_name; } #reads each word and update the transitiontable with it while(<$fh>){ my $word = $_; chomp $word; $word =~ s/(\w)/lc($1)/ge;#put it in full lowercase $word =~ s/[^[[::alphanum::]]]//;#gets rid of non alph +anumeric #symbols my @letters = split '',$word; my $last_letter = 'empty';#first iteration starts from the emp +ty #letter for(my $i = 0; $i <= $#letters;$i++){ my $cur_letter = $letters[$i]; #if there is no subkey in the transition table + then we start #with one which value is set at 1 $hash->{$last_letter}->{$cur_letter} = defined $hash->{$last_letter}->{$cur_letter} + ? $hash->{$last_letter}->{$cur_letter}++ +: 1; $self->{t_number}++; $last_letter = $cur_letter; } $words_learnt++; } my $total = 0; seek $fh,0,0; while (<$fh>){ my $word = $_; chomp $word; $word =~ s/(\w)/lc($1)/ge;#put it in full lowercase $word =~ s/[^[[::alphanum::]]]//;#gets rid of non alphanum +eric $total += $self->get_score($word); } $self->{mean_score} = (($self->{mean_score} * $self->{word_count}) +($total * $words_learnt)) / ($self->{word_count} + $words_learnt); $self->{word_count}+=$words_learnt; close $fh; return $words_learnt; } =head2 _transitions getter for t_number field =cut sub _transitions{ my $self = shift; $self->{t_number}; } =head2 get_score This method is used once the automaton has learnt enough words (enough + being left to the judgement of the user, dictionary files are a good source +I reckon) using it it will give you a score that will determine wether that lett +er ordering is unusual (the higher the score the most unusual it will be) +. The score should be around 0 for known words, normal transitions such as c +onsumn to vowels should not add too much to the score. strange ones( such as xxxbxczbbwdx)will do. Keep in mind that a completely unknown transitio +n will give one point, anything else will give a score calculated using +the following formula : 1 - (number of time this transition already occured / total of transit +ions recorded); =head3 use $automaton->get_score($word); =cut sub get_score{ my ($self,$word) = @_; my $score = 0; chomp $word; $word =~ s/(\w)/lc($1)/ge; $word =~ s/[^[[::alphanum::]]]//;#gets rid of non alphanumeric #symbols if(defined $self->{false_positives}->{$word}){ return $self->{mean_score}; } my @letters = split('',$word); my $last_letter = "empty"; foreach my $l (@letters){ $score += defined $self->{t_table_ref}->{$last_letter}->{$l}? $self->{t_table_ref}->{$last_letter}->{$l} / $self->{t_num +ber}: 1; $last_letter = $l; } return $score; } =head2 flag_as_false_positive this function allows you to flag a specific word as a false positi +ve afterward whenever the autmaton tries to calculate the score of th +e word it will return his mean score to prevent triggering an alert =cut sub flag_as_false_positive{ my($self,$word) = @_; $word =~ s/(\w)/lc($1)/ge; $word =~ s/[^[[::alphanum::]]]//;#gets rid of non alphanumeric #symbols $self->{false_positives}->{$word} = 1; } =head2 get_false_positives returns a reference to a list of flagged false positives from the +automata =cut sub get_false_positives{ my $self = shift; my @false_positives = keys %{$self->{false_positives}}; return \@false_positives; } =head2 sys_learn this subroutines uses the learn function on the system's /usr/share/di +ct/words file if it exists. if it does not it will return 0. If it does it will + return the number of learnt words if the file is empty it will return -1; =head3 use $automaton->sys_learn; =cut sub sys_learn{ my $self = shift; if (!ref $self){ croak("can not syslearn as class!\n"); } my $file_name = $dict_file; if (-e $file_name){ my $learnt = $self->learn($file_name); return $learnt; } else{ croak "bad package var dict_file\n"; } } 1; # End of Markov
And here is the test file : 100% testcover :)
#!perl -T use 5.006; use strict; use warnings; use Test::More; use Test::Exception; use diagnostics; use Security::Monitoring::Detection::Markov; use Security::Monitoring::Utils; BEGIN { plan tests => 25; use_ok( 'Security::Monitoring::Detection::Markov' ) || print "Bail + out!\n"; } our $dict_file; my $class='Security::Monitoring::Detection::Markov'; #setup and test function existence diag( "Testing Markov module $Security::Monitoring::Detection::Markov::VERSION, Perl $], $^X" ); ok(defined(Security::Monitoring::Detection::Markov->new(10)),"automata + generator working"); my $automaton = Security::Monitoring::Detection::Markov->new(10); ok(ref $automaton,"generator correctly returned a ref to the new autom +aton"); my $learnstring = "test_dict"; open my $fh, '>',$learnstring; print $fh "dd\nff"; close $fh; dies_ok(sub {my $new = $class->new(undef)},"new dies with undef alert_ +level"); dies_ok(sub {my $new = $class->new('trololol')},"new dies with non int alert_level"); #tests learning dies_ok(sub {$automaton->learn('trololol');},"learn dies with incorrec +t filename"); dies_ok(sub {$automaton->learn(undef);},"learn dies with undef filenam +e"); dies_ok(sub{$automaton->learn(undef,1)},"learn dies with undef string" +); dies_ok(sub {$automaton->learn('',1)},"learn dies with empty string"); #undefines all env variables to prevent taint warning flush_env; `/usr/bin/touch emptyfile`; dies_ok(sub{$automaton->learn('emptyfile')},"learn dies with empty fil +e"); is($automaton->learn($learnstring),2,"automaton learnt 2 words from fi +le"); unlink $learnstring; is($automaton->_transitions,4,"automaton has 4 transitions"); is($automaton->get_score("mama"),4,"automaton scored 4 on a four new l +etters word"); is($automaton->get_score("dd"),0.5,"automaton scored 0.5 on one of the + two learnt words"); is($automaton->learn("tata\ntutu",1),2,"automaton learnt two words fro +m string"); my $emptymaton = Security::Monitoring::Detection::Markov->new(10); $ENV{PATH} = '/usr/bin'; my $dict_words_number = `wc -l /usr/share/dict/words`; chomp $dict_words_number; $dict_words_number =~ s/[\D]//g; dies_ok(sub{$class->sys_learn;},"sys learn dies when called as class m +ethod"); my $tmp = $Security::Monitoring::Detection::Markov::dict_file; dies_ok(sub{ $Security::Monitoring::Detection::Markov::dict_file = 'tro +lolol'; my $auto = $class->new; $auto->sys_learn; },"sys learn dies with dummy dict file name"); dies_ok(sub{ $Security::Monitoring::Detection::Markov::dict_file = 'emp +tyfile'; my $auto = $class->new; $auto->sys_learn; },"sys learn dies with empty dict file "); unlink 'emptyfile'; $Security::Monitoring::Detection::Markov::dict_file= $tmp; is($emptymaton->sys_learn,$dict_words_number,"sys_learn works correctl +y"); is($emptymaton->{word_count},$dict_words_number,"automaton correctly r +eports number of words learnt"); #tests save and load ok(($automaton->save("automaton.at") =~ /automaton\.at/) && (-e "automaton.at"),"automaton correctly saved to disk"); my $newtomaton = $class->load("automaton.at"); ok(defined($newtomaton),"automaton reloaded"); unlink "automaton.at"; $newtomaton->{empty}->{a}+=1; my $autsub = $class->new(10); my $autsub2 = $class->new(10); my $hash1 = {a=>1}; my $hash2 = {b=>2}; $autsub->{t_table}->{x} = $hash1; $autsub2->{t_table}->{x} = $hash2; #tests false positive handling my $word = 'xxxxxxxmmmmmnnnxyxyxyxynmnmnmn'; ok($newtomaton->flag_as_false_positive($word),"false positive flagging + works"); is(${$newtomaton->get_false_positives}[0],$word,"get_false_positive wo +rks"); is($newtomaton->get_score($word),$newtomaton->{mean_score},"false posi +tive evaluation returns correct value");
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Markov Chain automata class
by VinsWorldcom (Prior) on Aug 24, 2015 at 18:41 UTC | |
by QuillMeantTen (Friar) on Aug 24, 2015 at 18:47 UTC | |
by VinsWorldcom (Prior) on Aug 24, 2015 at 18:51 UTC | |
by QuillMeantTen (Friar) on Aug 25, 2015 at 07:39 UTC |