Hi Monks.
I'm trying to learn Bayesian programming, and have found this code. It seems to work fine, but I'd like to be able to see what words it has associated for each category after it has run. Unfortunatly the Hash has me totally confused and I don't know how to print out the words in each category. Can anyone help me with a function to do this?
use strict; use DB_File; # Hash with two levels of keys: $words{category}{word} gives count of # 'word' in 'category'. Tied to a DB_File to keep it persistent. my %words; tie %words, 'DB_File', 'words.db'; # Read a file and return a hash of the word counts in that file sub parse_file { my ( $file ) = @_; my %word_counts; # Grab all the words with between 3 and 44 letters open FILE, "<$file"; while ( my $line = <FILE> ) { while ( $line =~ s/([[:alpha:]]{3,44})[ \t\n\r]// ) { $word_counts{lc($1)}++; } } close FILE; return %word_counts; } # Add words from a hash to the word counts for a category sub add_words { my ( $category, %words_in_file ) = @_; foreach my $word (keys %words_in_file) { $words{"$category-$word"} += $words_in_file{$word}; } } # Get the classification of a file from word counts sub classify { my ( %words_in_file ) = @_; # Calculate the total number of words in each category and # the total number of words overall my %count; my $total = 0; foreach my $entry (keys %words) { $entry =~ /^(.+)-(.+)$/; $count{$1} += $words{$entry}; $total += $words{$entry}; } # Run through words and calculate the probability for each categor +y my %score; foreach my $word (keys %words_in_file) { foreach my $category (keys %count) { if ( defined( $words{"$category-$word"} ) ) { $score{$category} += log( $words{"$category-$word"} / $count{$category} ); } else { $score{$category} += log( 0.01 / $count{$category} ); } } } # Add in the probability that the text is of a specific category foreach my $category (keys %count) { $score{$category} += log( $count{$category} / $total ); } foreach my $category (sort { $score{$b} <=> $score{$a} } keys %cou +nt) { print "$category $score{$category}\n"; } } # Supported commands are 'add' to add words to a category and # 'classify' to get the classification of a file if ( ( $ARGV[0] eq 'add' ) && ( $#ARGV == 2 ) ) { add_words( $ARGV[1], parse_file( $ARGV[2] ) ); } elsif ( ( $ARGV[0] eq 'classify' ) && ( $#ARGV == 1 ) ) { classify( parse_file( $ARGV[1] ) ); } else { print <<EOUSAGE; Usage: add <category> <file> - Adds words from <file> to category <cat +egory> classify <file> - Outputs classification of <file> EOUSAGE } untie %words;

In reply to hash function by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.