agnes has asked for the wisdom of the Perl Monks concerning the following question:

as the title said, I plan to use Algorithm::NaiveBayes module in Perl to classify sentences (positive, negative or neutral). I have already have the all the sentences and already split these sentence into words. I know the first thing I need to do is to convert the vector of words into a hash variable. But, I do not know how..Is it true I should take the category as hash key and each word as hash variables? I also need to know after converting these vectors of words in to hash variables how to use the Algorithm::NaiveBayes module. Is these anyone can help me to write the code to implement these function. Thanks a lot!!!!!
  • Comment on how to use Algorithm::NaiveBayes module

Replies are listed 'Best First'.
Re: how to use Algorithm::NaiveBayes module
by tangent (Parson) on Apr 23, 2014 at 03:05 UTC
    The first thing you need to do is "train" the categorizer with known labels. For each category you create a hash using the words as keys and the weights as values - typically the weight would be the number of times the word occurs but you could use other criteria.
    my $positive = { word1 => 2, word2 => 4, word3 => 1, }; my $negative = { word4 => 3, word5 => 1, };
    It is a good idea to normalize each word to lower case and perhaps to stem them, and also to remove words that don't have any effect on the outcome. You then add these hashes to the categorizer:
    my $categorizer = Algorithm::NaiveBayes->new; $categorizer->add_instance( attributes => $positive, label => 'positive'); $categorizer->add_instance( attributes => $negative, label => 'negative'); $categorizer->train;
    Then, for each of your sentences, you create a hash in a similar fashion and call predict() to find the probable classification of each sentence:
    my $sentence1 = { wordA => 2, wordB => 1, }; my $probability = $categorizer->predict(attributes => $sentence1); if ($probability->{'positive'} > 0.5) { # sentence1 probably positive } elsif ($probability->{'negative'} > 0.5) { # sentence1 probably negative }
    There is a section in the book Advanced Perl Programming - 2nd Edition entitled "Categorization and Extraction" that shows extended examples of using this module in conjunction with sentence splitters, stopword lists and stemmers.
      Thanks a lot for your help. I will read the book you recommend! Best wished!!
      hi~ I have already read the chapter "Categorization and Extraction" in Advanced Perl Programming and almost get the idea how to do my work. Here are some problems: 1.in the book the author has invert the document into the hash of words and weights. using the code below:
      sub invert_string { my ($string, $weight, $hash) = @_; $hash->{$_} += $weight for grep { !$StopWords{$_} } @{words(lc($string))}; }
      But,I have already do the stem and stop word in advance so I think the code you wrote:
      my $sentence1 = { wordA => 2, wordB => 1, };
      has the same function, is there any difference?

      2. If I have hundreds of sentence to make prediction, how can I invert all the sentence into hash variable?

      3.I can not quite get the function of the code:

      sub invert_item { my $item = shift; my %hash; invert_string($item->{title}, 2, \%hash); invert_string($item->{description}, 1, \%hash); return \%hash; }
      is that true because I do not need to separate the weight of title and contend, so I can ignore this step?

      4.here is the code in the book to train analyzer:

      #!/usr/bin/perl use XML::RSS; use Algorithm::NaiveBayes; use Lingua::EN::Splitter qw(words); use Lingua::EN::StopWords qw(%StopWords); my $nb = Algorithm::NaiveBayes->new( ); for my $category (qw(interesting boring)) { my $rss = new XML::RSS; $rss->parsefile("$category.rdf"); $nb->add_instance(attributes => invert_item($_), label => $category) for @{$rss->{'items +'}}; } $nb->train; # Work out all the probabilities
      I don't understand the function of:
      my $rss = new XML::RSS; $rss->parsefile("$category.rdf"); $nb->add_instance(attributes => invert_item($_), label => $category) for @{$rss->{'items +'}}; }
      If I have ignore the invert_item step, what should I write to take place of "@{$rss->{'items'}}"

      5. all the codes you wrote above should be written in one perl document or their have to be written separately and quote each other by name?

      Your reply will be surely helpful. Thank you so much!!

      hi~ I have already read the chapter "Categorization and Extraction" in Advanced Perl Programming and almost get the idea how to do my work. But I still have some questions, I really wish you can help me to solve them, Here are the questions:

      1.when I train categories, the code you wrote is

      my $positive = { word1 => 2, word2 => 4, word3 => 1, };
      is the code "word1 => 2" mean the number of times word1 appear in positive sentence? If I have 100 sentences taken as training sentence, then I need to produce hash for all the words in these sentence? Is there any easier way for me to train all the sentence?

      2.in the book the author has invert the document into the hash of words and weights. using the code below:

      sub invert_string { my ($string, $weight, $hash) = @_; $hash->{$_} += $weight for grep { !$StopWords{$_} } @{words(lc($string))}; }

      But,I have already do the stem and stop word in advance so I think the code you wrote:

      my $sentence1 = { wordA => 2, wordB => 1, };

      has the same function, is there any difference? If I have 100 training sentence, do I need to type all the different words in these training sentence using the code above?

      3. If I have hundreds of sentence to make prediction, how can I invert all the sentence into hash variable?

      4.I can not quite get the function of the code:

      sub invert_item { my $item = shift; my %hash; invert_string($item->{title}, 2, \%hash); invert_string($item->{description}, 1, \%hash); return \%hash; }

      is that true because I do not need to separate the weight of title and contend, so I can ignore this step?

      5.here is the code in the book to train analyzer:

      #!/usr/bin/perl use XML::RSS; use Algorithm::NaiveBayes; use Lingua::EN::Splitter qw(words); use Lingua::EN::StopWords qw(%StopWords); my $nb = Algorithm::NaiveBayes->new( ); for my $category (qw(interesting boring)) { my $rss = new XML::RSS; $rss->parsefile("$category.rdf"); $nb->add_instance(attributes => invert_item($_), label => $category) for @{$rss->{'items +'}}; } $nb->train; # Work out all the probabilities
      I don't understand the function of:
      my $rss = new XML::RSS; $rss->parsefile("$category.rdf"); $nb->add_instance(attributes => invert_item($_), label => $category) for @{$rss->{'items +'}}; }

      If I have ignore the invert_item step, what should I write to take place of "@{$rss->{'items'}}"

      6. all the codes you wrote above should be written in one perl document or their have to be written separately and quote each other by name?

      Your reply will be surely helpful. Thank you so much!!

        Say you have three files: positive, negative, and the sentences to test. They are already prepared and are in this format:
        wordA wordB wordC wordD wordA wordE wordF
        To train you would feed the first two files in:
        my $pos_file = '/path/to/positive.txt'; my $neg_file = '/path/to/negative.txt'; my $categorizer = Algorithm::NaiveBayes->new; my $fh; open($fh,"<",$pos_file) or die "Could not open $pos_file: $!"; while (my $sentence = <$fh>) { chomp $sentence; my @words = split(' ',$sentence); my %positive; $positive{$_}++ for @words; $categorizer->add_instance( attributes => \%positive, label => 'positive'); } close($fh); open($fh,"<",$neg_file) or die "Could not open $neg_file: $!"; while (my $sentence = <$fh>) { chomp $sentence; my @words = split(' ',$sentence); my %negative; $negative{$_}++ for @words; $categorizer->add_instance( attributes => \%negative, label => 'negative'); } close($fh); $categorizer->train;
        You can then feed the third file in:
        my $sentence_file = '/path/to/sentence.txt'; open($fh,"<",$sentence_file) or die "Could not open $sentence_file: $! +"; while (my $sentence = <$fh>) { chomp $sentence; my @words = split(' ',$sentence); my %test; $test{$_}++ for @words; my $probability = $categorizer->predict(attributes => \%test); # ... # do what you need with $probability } close($fh);
Re: how to use Algorithm::NaiveBayes module
by atcroft (Abbot) on Apr 23, 2014 at 03:05 UTC

    The Synopsis section of the Algorithm::NaiveBayes docs seems fairly straight-forward. While I haven't tested the code below, I believe it should work as intended.

    Hope that helps.

      Hi! Thank you for your code again. Can you please tell me the function of Data::Dumper? if I want to classify each training sentence in to certain category, such as: revenue, cost, profit and so on, since the same word will have different tone in different environment, for example: the word increase would be positive, if it appears in the sentence about revenue, but it would be negative, if it appears in the sentence about cost. How can I modify the code to implement the function? Your reply will be helpful to me. Thank you!
      I have run the code, the result is like this:

      Prediction: HASH(0x7f8f540c4c00) - /Users/Agnes/Documents/process_sentence.txt

      Prediction: HASH(0x7f8f540c2380) - /Users/Agnes/Documents/test.txt

      my code is as follow:
      #!/usr/bin/perl use Algorithm::NaiveBayes; use Data::Dumper; $| = 1; $Data::Dumper::Deepcopy = 1; $Data::Dumper::Sortkeys = 1; my %training_files = ( positive => q{/Users/Agnes/Documents/positive.txt}, negative => q{/Users/Agnes/Documents/negative.txt}, neutral => q{/Users/Agnes/Documents/neutral.txt}, ); my @test_files = ( q{/Users/Agnes/Documents/process_sentence.txt}, q{/Users/Agnes/Documents/test.txt}, ); my $nb = Algorithm::NaiveBayes->new( purge => 0, ); foreach my $k ( keys %training_files ) { local $/; open my $inf, q{<}, $training_files{$k} or die $!; my $line = <$inf>; close $inf; $nb->add_instance( attributes => str_to_array( $line ), label => [ $word ], ); } $nb->train; foreach my $tf ( @test_files ) { local $/; open my $inf, q{<}, $tf or die $!; my $line = <$inf>; close $inf; my $result = $nb->predict( attributes => str_to_array( $line ), ); print qq{Prediction: $result - $tf\n}; } sub str_to_array { my ($str) = @_; my %attr; foreach my $word ( split /\s|[\(\)!?.,:;]/, $str ) { $attr{$word}++; } return \%attr; }
      I s there any problem in it?
      Thank you so much! I will try this code and report my result to you!! Have a good night!!