in reply to using hash to find frequency count

The problem lies with your split. @list contains things which aren't words, including 0-length strings.

Perhaps you should replace
my @list = split /[,.?!)"]?\s\)?/, shift;
with something like
my @list = shift =~ /([a-zA-Z'\-]+)/g;

$maxcount++;
should be
$maxcount = $count;

readline is poorly named. It doesn't read anything.

$hash{lc $word}++;
$count = $hash{lc $word};
can be simplified to
$count = ++$hash{lc $word};

sub read_file{ ... }
read_file @ARGV;
can be simplified to
while (<>) { read_line $_; }

Replies are listed 'Best First'.
Re^2: using hash to find frequency count
by jjohhn (Scribe) on May 15, 2005 at 03:57 UTC
    Thanks, ikegami. Your comments made the code much more readable. I don't fully understand what the matching is doing though; it does not appear to be splitting the line on delimiters, and I am unclear on the fuction of the parens.
    I made the repairs as you suggest; my new code is still lacking something
    C:\scripts>wordcount.pl alice.txt
    distinct words: 0
    frequency of most common word:
    common word:
    use strict; my $maxcount; my $find; my $file; my %hash; my $count; while(<>){ my @list = shift =~ /([a-xA-Z'\-]+)/g; foreach my $word (@list) { $count =++$hash{lc $word}; if ($count > $maxcount) { $maxcount = $count; } } } my $numwords= keys %hash; print "distinct words: $numwords\n"; print "frequency of most common word: $maxcount\n"; print "common word: $find";
      Changing
      while(<>){ my @list = shift =~ /([a-xA-Z'\-]+)/g;
      to
      while(<>){ my @list = $_ =~ /([a-xA-Z'\-]+)/g;
      gave me results now; could you explain a little what the match is doing to parse the lines?
      C:\scripts>wordcount.pl alice.txt
      distinct words: 2815
      frequency of most common word: 1779
      common word:
        The answer, not surprisingly, is "the".
        The match was the key, as ikegami said.
        I would greatly appreciate a hint about how it is producing an appropriate list of words to count. This improved approach does not appear to parse a string on delimiters, but to alter the value of $_.
        could you explain a little what the match is doing to parse the lines?

        It says: Match a "word", defined as a sequence of one or more letters, hyphens and apostrophes ([...]+). When you find that, return it (()). Repeat (/g). That definition of a word is rather primitve, and may need to be tweaked.

        use strict; my $maxcount; my $find; my $file; my %hash; my $count; while (<>) { while (/([a-zA-Z'\-]+)/g) { my $word = $1; $count = ++$hash{lc $word}; if ($count > $maxcount) { $find = $word; $maxcount = $count; } } } my $numwords = keys %hash; print "distinct words: $numwords\n"; print "frequency of most common word: $maxcount\n"; print "common word: $find"; __END__ output of perl script.pl script.pl ================================== distinct words: 25 frequency of most common word: 7 common word: my