in reply to Re: how to use Algorithm::NaiveBayes module
in thread how to use Algorithm::NaiveBayes module

hi~ I have already read the chapter "Categorization and Extraction" in Advanced Perl Programming and almost get the idea how to do my work. But I still have some questions, I really wish you can help me to solve them, Here are the questions:

1.when I train categories, the code you wrote is

my $positive = { word1 => 2, word2 => 4, word3 => 1, };
is the code "word1 => 2" mean the number of times word1 appear in positive sentence? If I have 100 sentences taken as training sentence, then I need to produce hash for all the words in these sentence? Is there any easier way for me to train all the sentence?

2.in the book the author has invert the document into the hash of words and weights. using the code below:

sub invert_string { my ($string, $weight, $hash) = @_; $hash->{$_} += $weight for grep { !$StopWords{$_} } @{words(lc($string))}; }

But,I have already do the stem and stop word in advance so I think the code you wrote:

my $sentence1 = { wordA => 2, wordB => 1, };

has the same function, is there any difference? If I have 100 training sentence, do I need to type all the different words in these training sentence using the code above?

3. If I have hundreds of sentence to make prediction, how can I invert all the sentence into hash variable?

4.I can not quite get the function of the code:

sub invert_item { my $item = shift; my %hash; invert_string($item->{title}, 2, \%hash); invert_string($item->{description}, 1, \%hash); return \%hash; }

is that true because I do not need to separate the weight of title and contend, so I can ignore this step?

5.here is the code in the book to train analyzer:

#!/usr/bin/perl use XML::RSS; use Algorithm::NaiveBayes; use Lingua::EN::Splitter qw(words); use Lingua::EN::StopWords qw(%StopWords); my $nb = Algorithm::NaiveBayes->new( ); for my $category (qw(interesting boring)) { my $rss = new XML::RSS; $rss->parsefile("$category.rdf"); $nb->add_instance(attributes => invert_item($_), label => $category) for @{$rss->{'items +'}}; } $nb->train; # Work out all the probabilities
I don't understand the function of:
my $rss = new XML::RSS; $rss->parsefile("$category.rdf"); $nb->add_instance(attributes => invert_item($_), label => $category) for @{$rss->{'items +'}}; }

If I have ignore the invert_item step, what should I write to take place of "@{$rss->{'items'}}"

6. all the codes you wrote above should be written in one perl document or their have to be written separately and quote each other by name?

Your reply will be surely helpful. Thank you so much!!

Replies are listed 'Best First'.
Re^3: how to use Algorithm::NaiveBayes module
by tangent (Parson) on Apr 24, 2014 at 11:31 UTC
    Say you have three files: positive, negative, and the sentences to test. They are already prepared and are in this format:
    wordA wordB wordC wordD wordA wordE wordF
    To train you would feed the first two files in:
    my $pos_file = '/path/to/positive.txt'; my $neg_file = '/path/to/negative.txt'; my $categorizer = Algorithm::NaiveBayes->new; my $fh; open($fh,"<",$pos_file) or die "Could not open $pos_file: $!"; while (my $sentence = <$fh>) { chomp $sentence; my @words = split(' ',$sentence); my %positive; $positive{$_}++ for @words; $categorizer->add_instance( attributes => \%positive, label => 'positive'); } close($fh); open($fh,"<",$neg_file) or die "Could not open $neg_file: $!"; while (my $sentence = <$fh>) { chomp $sentence; my @words = split(' ',$sentence); my %negative; $negative{$_}++ for @words; $categorizer->add_instance( attributes => \%negative, label => 'negative'); } close($fh); $categorizer->train;
    You can then feed the third file in:
    my $sentence_file = '/path/to/sentence.txt'; open($fh,"<",$sentence_file) or die "Could not open $sentence_file: $! +"; while (my $sentence = <$fh>) { chomp $sentence; my @words = split(' ',$sentence); my %test; $test{$_}++ for @words; my $probability = $categorizer->predict(attributes => \%test); # ... # do what you need with $probability } close($fh);
      I got it, thank you for the detail response. But here is one problem, I have divided these sentence into different category, such as: revenue, cost, profit and so on, because the same word will have different tone in different environment, for example, the word increase. If it appears in the sentence about revenue, it is positive. However, if it appears in the sentence about cost, it is negative. So how can I make some modification in the code you just provided to implement this function? Thanks again!!!!
      hi~here is my code, which I ignore the category I mentioned above(revenue, cost..)
      #!/usr/bin/perl use warnings; use Algorithm::NaiveBayes; my $pos_file = '/Users/Agnes/Documents/positive.TXT'; my $neg_file = '/Users/Agnes/Documents/negative.txt'; my $neu_file = '/Users/Agnes/Documents/neutral.txt'; my $categorizer = Algorithm::NaiveBayes->new; my $fh; open($fh,"<",$pos_file) or die "Could not open $pos_file: $!"; while (my $sentence = <$fh>) { chomp $sentence; my @words = split(' ',$sentence); my %positive; $positive{$_}++ for @words; $categorizer->add_instance( attributes => \%positive, label => 'positive'); } close($fh); open($fh,"<",$neg_file) or die "Could not open $neg_file: $!"; while (my $sentence = <$fh>) { chomp $sentence; my @words = split(' ',$sentence); my %negative; $negative{$_}++ for @words; $categorizer->add_instance( attributes => \%negative, label => 'negative'); } close($fh); open($fh,"<",$neu_file) or die "Could not open $neg_file: $!"; while (my $sentence = <$fh>) { chomp $sentence; my @words = split(' ',$sentence); my %neutral; $neutral{$_}++ for @words; $categorizer->add_instance( attributes => \%neutral, label => 'neutral'); } close($fh); $categorizer->train; my $sentence_file = '/Users/Agnes/Documents/process_sentence.txt'; open($fh,"<",$sentence_file) or die "Could not open $sentence_file: $! +"; while (my $sentence = <$fh>) { chomp $sentence; my @words = split(' ',$sentence); my %test; $test{$_}++ for @words; my $probability = $categorizer->predict(attributes => \%test); if ( $probs->{positive} > 0.33 ) { print "%positive\n"; } if ( $probs->{negative} > 0.33 ) { print "%negative\n"; } if ( $probs->{neutral} > 0.33 ) { print "%neutral\n"; } } close($fh);

      my positive.txt is like this:

      we believ exist cash cash equiv short-term investments, togeth fund generat operations, suffici meet oper requirements, regular quarter dividends, debt.

      expen will reduc cut travel expenditures, reduc spend vendor cont staff, reduc market spending, scale back capit.

      revenu relat window vista no subject similar deferr no signif undeliv elements.

      but when I run this program, the mistakes shows

      Use of uninitialized value in numeric gt (>) at calculation.pl line 60, <$fh> line 1.

      Use of uninitialized value in numeric gt (>) at calculation.pl line 63, <$fh> line 1.

      Use of uninitialized value in numeric gt (>) at calculation.pl line 66, <$fh> line 1.

        $probs->{positive} should be $probability->{positive}

        Also, if you have empty lines in your files then add next unless $sentence; after each chomp, and you need to remove the commas from each sentence too.