how to use Algorithm::NaiveBayes module

agnes has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: how to use Algorithm::NaiveBayes module by tangent (Parson) on Apr 23, 2014 at 03:05 UTC
The first thing you need to do is "train" the categorizer with known labels. For each category you create a hash using the words as keys and the weights as values - typically the weight would be the number of times the word occurs but you could use other criteria. `my $positive = { word1 => 2, word2 => 4, word3 => 1, }; my $negative = { word4 => 3, word5 => 1, };` [download] It is a good idea to normalize each word to lower case and perhaps to stem them, and also to remove words that don't have any effect on the outcome. You then add these hashes to the categorizer: `my $categorizer = Algorithm::NaiveBayes->new; $categorizer->add_instance( attributes => $positive, label => 'positive'); $categorizer->add_instance( attributes => $negative, label => 'negative'); $categorizer->train;` [download] Then, for each of your sentences, you create a hash in a similar fashion and call predict() to find the probable classification of each sentence: `my $sentence1 = { wordA => 2, wordB => 1, }; my $probability = $categorizer->predict(attributes => $sentence1); if ($probability->{'positive'} > 0.5) { # sentence1 probably positive } elsif ($probability->{'negative'} > 0.5) { # sentence1 probably negative }` [download] There is a section in the book Advanced Perl Programming - 2nd Edition entitled "Categorization and Extraction" that shows extended examples of using this module in conjunction with sentence splitters, stopword lists and stemmers.	[reply] [d/l] [select]
Re^2: how to use Algorithm::NaiveBayes module by Anonymous Monk on Apr 23, 2014 at 03:54 UTC
Thanks a lot for your help. I will read the book you recommend! Best wished!!	[reply]
Re^2: how to use Algorithm::NaiveBayes module by agnes (Novice) on Apr 23, 2014 at 15:47 UTC
hi~ I have already read the chapter "Categorization and Extraction" in Advanced Perl Programming and almost get the idea how to do my work. Here are some problems: 1.in the book the author has invert the document into the hash of words and weights. using the code below: `sub invert_string { my ($string, $weight, $hash) = @_; $hash->{$_} += $weight for grep { !$StopWords{$_} } @{words(lc($string))}; }` [download] But,I have already do the stem and stop word in advance so I think the code you wrote: `my $sentence1 = { wordA => 2, wordB => 1, };` [download] has the same function, is there any difference? 2. If I have hundreds of sentence to make prediction, how can I invert all the sentence into hash variable? 3.I can not quite get the function of the code: `sub invert_item { my $item = shift; my %hash; invert_string($item->{title}, 2, \%hash); invert_string($item->{description}, 1, \%hash); return \%hash; }` [download] is that true because I do not need to separate the weight of title and contend, so I can ignore this step? 4.here is the code in the book to train analyzer: `#!/usr/bin/perl use XML::RSS; use Algorithm::NaiveBayes; use Lingua::EN::Splitter qw(words); use Lingua::EN::StopWords qw(%StopWords); my $nb = Algorithm::NaiveBayes->new( ); for my $category (qw(interesting boring)) { my $rss = new XML::RSS; $rss->parsefile("$category.rdf"); $nb->add_instance(attributes => invert_item($_), label => $category) for @{$rss->{'items +'}}; } $nb->train; # Work out all the probabilities` [download] I don't understand the function of: `my $rss = new XML::RSS; $rss->parsefile("$category.rdf"); $nb->add_instance(attributes => invert_item($_), label => $category) for @{$rss->{'items +'}}; }` [download] If I have ignore the invert_item step, what should I write to take place of "`@{$rss->{'items'}}`" 5. all the codes you wrote above should be written in one perl document or their have to be written separately and quote each other by name? Your reply will be surely helpful. Thank you so much!!	[reply] [d/l] [select]
Re^2: how to use Algorithm::NaiveBayes module by agnes (Novice) on Apr 23, 2014 at 19:12 UTC
hi~ I have already read the chapter "Categorization and Extraction" in Advanced Perl Programming and almost get the idea how to do my work. But I still have some questions, I really wish you can help me to solve them, Here are the questions: 1.when I train categories, the code you wrote is `my $positive = { word1 => 2, word2 => 4, word3 => 1, };` [download] is the code "word1 => 2" mean the number of times word1 appear in positive sentence? If I have 100 sentences taken as training sentence, then I need to produce hash for all the words in these sentence? Is there any easier way for me to train all the sentence? 2.in the book the author has invert the document into the hash of words and weights. using the code below: `sub invert_string { my ($string, $weight, $hash) = @_; $hash->{$_} += $weight for grep { !$StopWords{$_} } @{words(lc($string))}; }` [download] But,I have already do the stem and stop word in advance so I think the code you wrote: `my $sentence1 = { wordA => 2, wordB => 1, };` [download] has the same function, is there any difference? If I have 100 training sentence, do I need to type all the different words in these training sentence using the code above? 3. If I have hundreds of sentence to make prediction, how can I invert all the sentence into hash variable? 4.I can not quite get the function of the code: `sub invert_item { my $item = shift; my %hash; invert_string($item->{title}, 2, \%hash); invert_string($item->{description}, 1, \%hash); return \%hash; }` [download] is that true because I do not need to separate the weight of title and contend, so I can ignore this step? 5.here is the code in the book to train analyzer: `#!/usr/bin/perl use XML::RSS; use Algorithm::NaiveBayes; use Lingua::EN::Splitter qw(words); use Lingua::EN::StopWords qw(%StopWords); my $nb = Algorithm::NaiveBayes->new( ); for my $category (qw(interesting boring)) { my $rss = new XML::RSS; $rss->parsefile("$category.rdf"); $nb->add_instance(attributes => invert_item($_), label => $category) for @{$rss->{'items +'}}; } $nb->train; # Work out all the probabilities` [download] I don't understand the function of: `my $rss = new XML::RSS; $rss->parsefile("$category.rdf"); $nb->add_instance(attributes => invert_item($_), label => $category) for @{$rss->{'items +'}}; }` [download] If I have ignore the invert_item step, what should I write to take place of "@{$rss->{'items'}}" 6. all the codes you wrote above should be written in one perl document or their have to be written separately and quote each other by name? Your reply will be surely helpful. Thank you so much!!	[reply] [d/l] [select]
Re^3: how to use Algorithm::NaiveBayes module by tangent (Parson) on Apr 24, 2014 at 11:31 UTC
Say you have three files: positive, negative, and the sentences to test. They are already prepared and are in this format: `wordA wordB wordC wordD wordA wordE wordF` [download] To train you would feed the first two files in: my $pos_file = '/path/to/positive.txt'; my $neg_file = '/path/to/negative.txt'; my $categorizer = Algorithm::NaiveBayes->new; my $fh; open($fh,"<",$pos_file) or die "Could not open $pos_file: $!"; while (my $sentence = <$fh>) { chomp $sentence; my @words = split(' ',$sentence); my %positive; $positive{$_}++ for @words; $categorizer->add_instance( attributes => \%positive, label => 'positive'); } close($fh); open($fh,"<",$neg_file) or die "Could not open $neg_file: $!"; while (my $sentence = <$fh>) { chomp $sentence; my @words = split(' ',$sentence); my %negative; $negative{$_}++ for @words; $categorizer->add_instance( attributes => \%negative, label => 'negative'); } close($fh); $categorizer->train; [download] You can then feed the third file in: `my $sentence_file = '/path/to/sentence.txt'; open($fh,"<",$sentence_file) or die "Could not open $sentence_file: $! +"; while (my $sentence = <$fh>) { chomp $sentence; my @words = split(' ',$sentence); my %test; $test{$_}++ for @words; my $probability = $categorizer->predict(attributes => \%test); # ... # do what you need with $probability } close($fh);` [download]	[reply] [d/l] [select]
Re^4: how to use Algorithm::NaiveBayes module by agnes (Novice) on Apr 24, 2014 at 14:28 UTC
Re^4: how to use Algorithm::NaiveBayes module by agnes (Novice) on Apr 24, 2014 at 18:41 UTC
Re^5: how to use Algorithm::NaiveBayes module by tangent (Parson) on Apr 24, 2014 at 19:45 UTC
Some notes below your chosen depth have not been shown here
Re: how to use Algorithm::NaiveBayes module by atcroft (Abbot) on Apr 23, 2014 at 03:05 UTC
The Synopsis section of the Algorithm::NaiveBayes docs seems fairly straight-forward. While I haven't tested the code below, I believe it should work as intended. Read more... (1440 Bytes) Hope that helps.	[reply] [d/l]
Re^2: how to use Algorithm::NaiveBayes module by agnes (Novice) on Apr 24, 2014 at 19:59 UTC
Hi! Thank you for your code again. Can you please tell me the function of Data::Dumper? if I want to classify each training sentence in to certain category, such as: revenue, cost, profit and so on, since the same word will have different tone in different environment, for example: the word increase would be positive, if it appears in the sentence about revenue, but it would be negative, if it appears in the sentence about cost. How can I modify the code to implement the function? Your reply will be helpful to me. Thank you!	[reply]
Re^2: how to use Algorithm::NaiveBayes module by agnes (Novice) on Apr 24, 2014 at 21:09 UTC
I have run the code, the result is like this: Prediction: HASH(0x7f8f540c4c00) - /Users/Agnes/Documents/process_sentence.txt Prediction: HASH(0x7f8f540c2380) - /Users/Agnes/Documents/test.txt my code is as follow: #!/usr/bin/perl use Algorithm::NaiveBayes; use Data::Dumper; $\| = 1; $Data::Dumper::Deepcopy = 1; $Data::Dumper::Sortkeys = 1; my %training_files = ( positive => q{/Users/Agnes/Documents/positive.txt}, negative => q{/Users/Agnes/Documents/negative.txt}, neutral => q{/Users/Agnes/Documents/neutral.txt}, ); my @test_files = ( q{/Users/Agnes/Documents/process_sentence.txt}, q{/Users/Agnes/Documents/test.txt}, ); my $nb = Algorithm::NaiveBayes->new( purge => 0, ); foreach my $k ( keys %training_files ) { local $/; open my $inf, q{<}, $training_files{$k} or die $!; my $line = <$inf>; close $inf; $nb->add_instance( attributes => str_to_array( $line ), label => [ $word ], ); } $nb->train; foreach my $tf ( @test_files ) { local $/; open my $inf, q{<}, $tf or die $!; my $line = <$inf>; close $inf; my $result = $nb->predict( attributes => str_to_array( $line ), ); print qq{Prediction: $result - $tf\n}; } sub str_to_array { my ($str) = @_; my %attr; foreach my $word ( split /\s\|[!?.,:;]/, $str ) { $attr{$word}++; } return \%attr; } [download] I s there any problem in it?	[reply] [d/l]
Re^2: how to use Algorithm::NaiveBayes module by agnes (Novice) on Apr 23, 2014 at 03:55 UTC
Thank you so much! I will try this code and report my result to you!! Have a good night!!	[reply]