using hash to find frequency count

jjohhn has asked for the wisdom of the Perl Monks concerning the following question:

I am a reading a text file (Alice in Wonderland), parsing each line in a primitive way, trying to print the most common "word" using a hash.
I could do it by iterating through the keys, but instead I am trying to keep track each time I catch a word.
My program prints as shown below.
("ALICE'S" is the first word in the file; I would expect the screen to be full of the rest of the words too.)

C:\scripts>wordcount.pl alice.txt
ALICE'Sdistinct words: 4007
frequency of most common word: 5160
common word:

use strict;

my $maxcount;
my $find;

sub read_line {
  our %hash;

  my @list = split /[,.?!)"]?\s\)?/, shift;
  my $count;
  foreach my $word (@list) {
    $hash{lc $word}++;
    $count =$hash{lc $word};
    if ($count > $maxcount) {
      print "$word";
      $maxcount++;
      $find = $word;
    }
  }
}

sub read_file{
  my $file=shift;
  open (FILE, $file) or die "couldn't open $file: $!";
  while (my $line = <FILE>) {
    read_line $line;
  }
}
read_file @ARGV;

my $numwords= keys our %hash;
print "distinct words: $numwords\n";
print "frequency of most common word: $maxcount\n";
print "common word: $find";
[download]

Comment on using hash to find frequency count Download Code

Replies are listed 'Best First'.
Re: using hash to find frequency count by ikegami (Patriarch) on May 15, 2005 at 03:30 UTC
The problem lies with your `split`. `@list` contains things which aren't words, including 0-length strings. Perhaps you should replace `my @list = split /[,.?!)"]?\s\)?/, shift;` with something like `my @list = shift =~ /([a-zA-Z'\-]+)/g;` `$maxcount++;` should be `$maxcount = $count;` `readline` is poorly named. It doesn't read anything. `$hash{lc $word}++;` `$count = $hash{lc $word};` can be simplified to `$count = ++$hash{lc $word};` `sub read_file{ ... }` `read_file @ARGV;` can be simplified to `while (<>) { read_line $_; }`	[reply] [d/l] [select]
Re^2: using hash to find frequency count by jjohhn (Scribe) on May 15, 2005 at 03:57 UTC
Thanks, ikegami. Your comments made the code much more readable. I don't fully understand what the matching is doing though; it does not appear to be splitting the line on delimiters, and I am unclear on the fuction of the parens. I made the repairs as you suggest; my new code is still lacking something C:\scripts>wordcount.pl alice.txt distinct words: 0 frequency of most common word: common word: `use strict; my $maxcount; my $find; my $file; my %hash; my $count; while(<>){ my @list = shift =~ /([a-xA-Z'\-]+)/g; foreach my $word (@list) { $count =++$hash{lc $word}; if ($count > $maxcount) { $maxcount = $count; } } } my $numwords= keys %hash; print "distinct words: $numwords\n"; print "frequency of most common word: $maxcount\n"; print "common word: $find";` [download]	[reply] [d/l]
Re^3: using hash to find frequency count by jjohhn (Scribe) on May 15, 2005 at 04:07 UTC
Changing `while(<>){ my @list = shift =~ /([a-xA-Z'\-]+)/g;` [download] to `while(<>){ my @list = $_ =~ /([a-xA-Z'\-]+)/g;` [download] gave me results now; could you explain a little what the match is doing to parse the lines? C:\scripts>wordcount.pl alice.txt distinct words: 2815 frequency of most common word: 1779 common word:	[reply] [d/l] [select]
Re^4: using hash to find frequency count by jjohhn (Scribe) on May 15, 2005 at 04:23 UTC
Re^4: using hash to find frequency count by ikegami (Patriarch) on May 15, 2005 at 13:44 UTC
Re: using hash to find frequency count by jpeg (Chaplain) on May 15, 2005 at 03:32 UTC
A few things: Doesn't the code instantiate a new %hash and $count every time read_line is called? since you're resetting $count for every line, $count will never > $maxcount and $find is never set, as you see in the output. It's a good idea to separate the data collection steps from the analysis steps. Move that if block into its own sub. -- jpg	[reply]
Re^2: using hash to find frequency count by ikegami (Patriarch) on May 15, 2005 at 03:37 UTC
Doesn't the code instantiate a new %hash and $count every time read_line is called? No. `our` makes it a global, so the function always uses the same variable, `%main::hash`. Since `$count` comes from a constantly updated global, it grows bigger than `$maxcount`. If `$count` never outgrew `$maxcount`, `$maxcount` would never be higher than 1, yet it's 5160 for the OP's test run. since you're resetting $count for every line, $count will never > $maxcount and $find is never set, as you see in the output. Actually, you can see that `$maxcount` IS set in the output. `$find` is also set, but it's not so obvious because it happens to be set to the 0-length string. It's a good idea to separate the data collection steps from the analysis steps. Move that if block into its own sub. Actually, he already did that. `readfile` reads the lines, and the misnamed `readline` processes (analyses) them.	[reply] [d/l] [select]
Re: using hash to find frequency count by ghenry (Vicar) on May 15, 2005 at 11:39 UTC
Maybe you could get some ideas from my Wordcounting with DBM files - can't open file? node. Maybe not ;-) Walking the road to enlightenment... I found a penguin and a camel on the way..... Fancy a yourname@perl.me.uk? Just ask!!!	[reply]
Re: using hash to find frequency count by tlm (Prior) on May 15, 2005 at 14:54 UTC
Aside from the question of how suitable the items in `@list` are for your purposes, the logic in `read_line`'s loop doesn't make sense to me. I think that what you want is: `foreach my $word (@list) { print $word, $/ unless $hash{ lc $word }++; $count = $hash{ lc $word }; if ($count > $maxcount) { $maxcount = $count; $find = $word; } }` [download] the lowliest monk	[reply] [d/l] [select]