hash function

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks.
I'm trying to learn Bayesian programming, and have found this code. It seems to work fine, but I'd like to be able to see what words it has associated for each category after it has run. Unfortunatly the Hash has me totally confused and I don't know how to print out the words in each category. Can anyone help me with a function to do this?


use strict;
use DB_File;

# Hash with two levels of keys: $words{category}{word} gives count of
# 'word' in 'category'.  Tied to a DB_File to keep it persistent.

my %words;
tie %words, 'DB_File', 'words.db';

# Read a file and return a hash of the word counts in that file

sub parse_file
{
    my ( $file ) = @_;
    my %word_counts;

    # Grab all the words with between 3 and 44 letters

    open FILE, "<$file";
    while ( my $line = <FILE> ) {
        while ( $line =~ s/([[:alpha:]]{3,44})[ \t\n\r]// ) {
            $word_counts{lc($1)}++;
        }
    }
    close FILE;
    return %word_counts;
}

# Add words from a hash to the word counts for a category
sub add_words
{
    my ( $category, %words_in_file ) = @_;

    foreach my $word (keys %words_in_file) {
        $words{"$category-$word"} += $words_in_file{$word};
    }
}

# Get the classification of a file from word counts
sub classify
{
    my ( %words_in_file ) = @_;

    # Calculate the total number of words in each category and
    # the total number of words overall

    my %count;
    my $total = 0;
    foreach my $entry (keys %words) {
        $entry =~ /^(.+)-(.+)$/;
        $count{$1} += $words{$entry};
        $total += $words{$entry};
    }

    # Run through words and calculate the probability for each categor
+y

    my %score;
    foreach my $word (keys %words_in_file) {
        foreach my $category (keys %count) {
            if ( defined( $words{"$category-$word"} ) ) {
                $score{$category} += log( $words{"$category-$word"} /
                                          $count{$category} );
            } else {
                $score{$category} += log( 0.01 /
                                          $count{$category} );
            }
        }
    }
    # Add in the probability that the text is of a specific category

    foreach my $category (keys %count) {
        $score{$category} += log( $count{$category} / $total );
    }
    foreach my $category (sort { $score{$b} <=> $score{$a} } keys %cou
+nt) {
        print "$category $score{$category}\n";
    }
}

# Supported commands are 'add' to add words to a category and
# 'classify' to get the classification of a file

if ( ( $ARGV[0] eq 'add' ) && ( $#ARGV == 2 ) ) {
    add_words( $ARGV[1], parse_file( $ARGV[2] ) );
} elsif ( ( $ARGV[0] eq 'classify' ) && ( $#ARGV == 1 ) ) {
    classify( parse_file( $ARGV[1] ) );
} else {
    print <<EOUSAGE;
Usage: add <category> <file> - Adds words from <file> to category <cat
+egory>
       classify <file>       - Outputs classification of <file>
EOUSAGE
}

untie %words;
[download]

Comment on hash function Download Code

Replies are listed 'Best First'.

Re: hash function
by grep (Monsignor) on Oct 07, 2006 at 16:17 UTC

Data::Dumper

#top of file
use Data::Dumper

# rest of code
print Dumper \%hash_i_want_to_look_at;
# this prints a nice representation of the data in the hash to STDOUT
[download]

I would give you a more specific solution, if you pare your code down to where you want to view the hash, instead of posting this large section of code.

Check out How (Not) To Ask A Question, and more specifically this section. Following those guideline, helps us give you better answer.

grep

One dead unjugged rabbit fish later

[reply]
[d/l]

Re^2: hash function

by Anonymous Monk on Oct 07, 2006 at 16:39 UTC

# show the words of a classification
sub show_classification
{
my ( $category ) = @_;
 #for each word in the matching category, print the word
}
[download]

[reply]
[d/l]

Re^3: hash function

by grep (Monsignor) on Oct 07, 2006 at 17:14 UTC

What I would've done asking this question: Go to the data. You want to look inside %words. So I would have posted the sub classify which uses %words and the TIE that starts the hash %words. The problem being is you're asking at a perl site - so you should figure some of us have an understanding of perl. But we don't know your data.

To your question:
Take the keys of the %words and split them on '-'.

## UNTESTED
use Data::Dumper;
my %cats;
foreach my $cat_word (keys %words) {
  my ($cat,$word) = split(/-/,$cat_word);
  push(@{$cats{$cat}},$word);
}
print Dumper \%cats;
[download]

%cats

As for adding that into your program and printing it nicer than Data::Dumper, I'll leave that up to you.

grep

One dead unjugged rabbit fish later

[reply]
[d/l]
[select]

Re: hash function
by BrowserUk (Patriarch) on Oct 07, 2006 at 17:13 UTC

As the category/word combinations are being stored in the hash combined into each key:

# from your add_words() sub
...
   $words{"$category-$word"} += $words_in_file{$word};
[download]

You cannot make use of the main feature of hashes (O(1) lookup), and you will instead have to iterate the keys of the hash, break them into their components and the print the word if the category matches.

sub show_classification{
    my ( $wantedCategory ) = @_;

   #for each word in the matching category, print the word
    for my $key ( keys %words ) {
        my( $category, $word ) = split '-', $key;
        print "$word\n" if $category eq $wantedCategory;
    }
}
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]
[d/l]
[select]

Re^2: hash function

by Anonymous Monk on Oct 07, 2006 at 17:23 UTC

Thanks BrowserUK! I don't fully understand it, but I'll sit down with my perl book now and figure out what you've done. cheers!

[reply]

Re: hash function
by graff (Chancellor) on Oct 07, 2006 at 17:17 UTC

~~I don't think Data::Dumper would help in terms of inspecting the contents of the DB file.~~ (thanks for the correction, grep) If you would like a plain-text (flat-table) dump of the DB file contents, something like this would do:

use strict;
use DB_File;

# Hash with composite key strings: $words{category-word} gives count o
+f
# 'word' in 'category'.  Tied to a DB_File to keep it persistent.

my %words;
tie %words, 'DB_File', 'words.db';

open( TXT, '>', 'words.txt';

while ( my ($key, $value) = each %words ) {
    print TXT "$key\t$value\n";
}
close TXT;
untie %words;
[download]

Loading the DB file contents into an in-memory data structure shouldn't be a problem for this kind of app, in case you want to do things like sorting, or working out how many categories are associated with each word, etc.

If there's something in particular you want to do with the data that you can't figure out, give us a clearer idea of what that might be (and your first try at a solution).

[reply]
[d/l]

Re^2: hash function

by grep (Monsignor) on Oct 07, 2006 at 18:32 UTC

Data::Dumper

Tied

use strict;
use warnings;
use Data::Dumper;

use DB_File;

my %words;
tie %words, 'DB_File', 'words.db';

load() if $ARGV[0];

print Dumper \%words;

sub load {
  my $cnt = 1;
  foreach my $key ( 'John Cleese', 'Graham Chapman', 'Eric Idle', 'Ter
+ry Jones', 'Michael Palin') {
    $words{$key} = "Gumby $cnt";
    $cnt++;
  }
}
[download]

grep

One dead unjugged rabbit fish later

[reply]
[d/l]

Re^3: hash function

by graff (Chancellor) on Oct 07, 2006 at 19:33 UTC

really

perl -MDB_File -MData::Dumper -e 'tie %h, "DB_File", "junk.db"; 
for ($i=0; $i<1_000_000; $i++) { $h{"key_$i"}="value_$i" };
warn "the hash is loaded\n"; sleep 15; warn "starting dump\n"; 
print Dumper(\%h); warn "the hash has been dumped\n"; sleep 15' > /dev
+/null
[download]

(The "junk.db" file itself was 47 MB, and a plain-text print out of keys and values as I suggested above would probably be about half that.)

[reply]
[d/l]

Re^4: hash function

by chromatic (Archbishop) on Oct 07, 2006 at 19:50 UTC