Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks.
I'm trying to learn Bayesian programming, and have found this code. It seems to work fine, but I'd like to be able to see what words it has associated for each category after it has run. Unfortunatly the Hash has me totally confused and I don't know how to print out the words in each category. Can anyone help me with a function to do this?
use strict; use DB_File; # Hash with two levels of keys: $words{category}{word} gives count of # 'word' in 'category'. Tied to a DB_File to keep it persistent. my %words; tie %words, 'DB_File', 'words.db'; # Read a file and return a hash of the word counts in that file sub parse_file { my ( $file ) = @_; my %word_counts; # Grab all the words with between 3 and 44 letters open FILE, "<$file"; while ( my $line = <FILE> ) { while ( $line =~ s/([[:alpha:]]{3,44})[ \t\n\r]// ) { $word_counts{lc($1)}++; } } close FILE; return %word_counts; } # Add words from a hash to the word counts for a category sub add_words { my ( $category, %words_in_file ) = @_; foreach my $word (keys %words_in_file) { $words{"$category-$word"} += $words_in_file{$word}; } } # Get the classification of a file from word counts sub classify { my ( %words_in_file ) = @_; # Calculate the total number of words in each category and # the total number of words overall my %count; my $total = 0; foreach my $entry (keys %words) { $entry =~ /^(.+)-(.+)$/; $count{$1} += $words{$entry}; $total += $words{$entry}; } # Run through words and calculate the probability for each categor +y my %score; foreach my $word (keys %words_in_file) { foreach my $category (keys %count) { if ( defined( $words{"$category-$word"} ) ) { $score{$category} += log( $words{"$category-$word"} / $count{$category} ); } else { $score{$category} += log( 0.01 / $count{$category} ); } } } # Add in the probability that the text is of a specific category foreach my $category (keys %count) { $score{$category} += log( $count{$category} / $total ); } foreach my $category (sort { $score{$b} <=> $score{$a} } keys %cou +nt) { print "$category $score{$category}\n"; } } # Supported commands are 'add' to add words to a category and # 'classify' to get the classification of a file if ( ( $ARGV[0] eq 'add' ) && ( $#ARGV == 2 ) ) { add_words( $ARGV[1], parse_file( $ARGV[2] ) ); } elsif ( ( $ARGV[0] eq 'classify' ) && ( $#ARGV == 1 ) ) { classify( parse_file( $ARGV[1] ) ); } else { print <<EOUSAGE; Usage: add <category> <file> - Adds words from <file> to category <cat +egory> classify <file> - Outputs classification of <file> EOUSAGE } untie %words;

Replies are listed 'Best First'.
Re: hash function
by grep (Monsignor) on Oct 07, 2006 at 16:17 UTC
    The quick, dirty, and universal way - use Data::Dumper. It should be already installed.

    #top of file use Data::Dumper # rest of code print Dumper \%hash_i_want_to_look_at; # this prints a nice representation of the data in the hash to STDOUT

    I would give you a more specific solution, if you pare your code down to where you want to view the hash, instead of posting this large section of code.

    Check out How (Not) To Ask A Question, and more specifically this section. Following those guideline, helps us give you better answer.



    grep
    One dead unjugged rabbit fish later
      Last time I asked a question, people complained that I didn't show all the code, so I thought it best to do it this time. Seems like its a case of damned if I do, damned if I don't eh?
      Basically I was after a function that I could call from the command line to print out all the words in a particular category:
      # show the words of a classification sub show_classification { my ( $category ) = @_; #for each word in the matching category, print the word }
        You'll get a nack for what works when asking questions. As an exercise, read other's questions and try to answer them (you don't have to answer, if you don't want). You'll quickly see how to think like someone trying answer a question - and what they need to answer it quickly.

        What I would've done asking this question: Go to the data. You want to look inside %words. So I would have posted the sub classify which uses %words and the TIE that starts the hash %words. The problem being is you're asking at a perl site - so you should figure some of us have an understanding of perl. But we don't know your data.

        To your question:
        Take the keys of the %words and split them on '-'.

        ## UNTESTED use Data::Dumper; my %cats; foreach my $cat_word (keys %words) { my ($cat,$word) = split(/-/,$cat_word); push(@{$cats{$cat}},$word); } print Dumper \%cats;
        There you should have a hash %cats with the words that are in each cat.

        As for adding that into your program and printing it nicer than Data::Dumper, I'll leave that up to you.



        grep
        One dead unjugged rabbit fish later
Re: hash function
by BrowserUk (Patriarch) on Oct 07, 2006 at 17:13 UTC

    As the category/word combinations are being stored in the hash combined into each key:

    # from your add_words() sub ... $words{"$category-$word"} += $words_in_file{$word};

    You cannot make use of the main feature of hashes (O(1) lookup), and you will instead have to iterate the keys of the hash, break them into their components and the print the word if the category matches.

    sub show_classification{ my ( $wantedCategory ) = @_; #for each word in the matching category, print the word for my $key ( keys %words ) { my( $category, $word ) = split '-', $key; print "$word\n" if $category eq $wantedCategory; } }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks BrowserUK! I don't fully understand it, but I'll sit down with my perl book now and figure out what you've done. cheers!
Re: hash function
by graff (Chancellor) on Oct 07, 2006 at 17:17 UTC
    If I understand your question, you just want to look at the contents of the DB file "words.db". The contents of the file are a set of key/value pairs where the keys are strings composed of a category name and a word taken from one or more input text files.

    I don't think Data::Dumper would help in terms of inspecting the contents of the DB file. (thanks for the correction, grep) If you would like a plain-text (flat-table) dump of the DB file contents, something like this would do:

    use strict; use DB_File; # Hash with composite key strings: $words{category-word} gives count o +f # 'word' in 'category'. Tied to a DB_File to keep it persistent. my %words; tie %words, 'DB_File', 'words.db'; open( TXT, '>', 'words.txt'; while ( my ($key, $value) = each %words ) { print TXT "$key\t$value\n"; } close TXT; untie %words;
    If you want something more "organized" in terms of output (e.g. sorting entries by category or by word), you could run whatever you want on the "words.txt" file, or you can the while loop above to do things other than just print out all the key/value pairs.

    Loading the DB file contents into an in-memory data structure shouldn't be a problem for this kind of app, in case you want to do things like sorting, or working out how many categories are associated with each word, etc.

    If there's something in particular you want to do with the data that you can't figure out, give us a clearer idea of what that might be (and your first try at a solution).

      Data::Dumper works just fine on Tied hashs.
      use strict; use warnings; use Data::Dumper; use DB_File; my %words; tie %words, 'DB_File', 'words.db'; load() if $ARGV[0]; print Dumper \%words; sub load { my $cnt = 1; foreach my $key ( 'John Cleese', 'Graham Chapman', 'Eric Idle', 'Ter +ry Jones', 'Michael Palin') { $words{$key} = "Gumby $cnt"; $cnt++; } }


      grep
      One dead unjugged rabbit fish later
        But... what if the DB file is really big? (Sometimes people tie a hash to a DB file because of the amount of data, not just for persistence.) When I tried the following test:
        perl -MDB_File -MData::Dumper -e 'tie %h, "DB_File", "junk.db"; for ($i=0; $i<1_000_000; $i++) { $h{"key_$i"}="value_$i" }; warn "the hash is loaded\n"; sleep 15; warn "starting dump\n"; print Dumper(\%h); warn "the hash has been dumped\n"; sleep 15' > /dev +/null
        Memory usage stayed at about 27 MB (macosx/perl 5.8.6) while the hash was being built and throughout the first sleep, then climbed over 450 MB during the Dump phase. Data::Dumper was making its own internal copies of the keys and values.

        (The "junk.db" file itself was 47 MB, and a plain-text print out of keys and values as I suggested above would probably be about half that.)