jjohhn has asked for the wisdom of the Perl Monks concerning the following question:

I am a reading a text file (Alice in Wonderland), parsing each line in a primitive way, trying to print the most common "word" using a hash.
I could do it by iterating through the keys, but instead I am trying to keep track each time I catch a word.
My program prints as shown below.
("ALICE'S" is the first word in the file; I would expect the screen to be full of the rest of the words too.)

C:\scripts>wordcount.pl alice.txt
ALICE'Sdistinct words: 4007
frequency of most common word: 5160
common word:

use strict; my $maxcount; my $find; sub read_line { our %hash; my @list = split /[,.?!)"]?\s\)?/, shift; my $count; foreach my $word (@list) { $hash{lc $word}++; $count =$hash{lc $word}; if ($count > $maxcount) { print "$word"; $maxcount++; $find = $word; } } } sub read_file{ my $file=shift; open (FILE, $file) or die "couldn't open $file: $!"; while (my $line = <FILE>) { read_line $line; } } read_file @ARGV; my $numwords= keys our %hash; print "distinct words: $numwords\n"; print "frequency of most common word: $maxcount\n"; print "common word: $find";

Replies are listed 'Best First'.
Re: using hash to find frequency count
by ikegami (Patriarch) on May 15, 2005 at 03:30 UTC

    The problem lies with your split. @list contains things which aren't words, including 0-length strings.

    Perhaps you should replace
    my @list = split /[,.?!)"]?\s\)?/, shift;
    with something like
    my @list = shift =~ /([a-zA-Z'\-]+)/g;

    $maxcount++;
    should be
    $maxcount = $count;

    readline is poorly named. It doesn't read anything.

    $hash{lc $word}++;
    $count = $hash{lc $word};
    can be simplified to
    $count = ++$hash{lc $word};

    sub read_file{ ... }
    read_file @ARGV;
    can be simplified to
    while (<>) { read_line $_; }

      Thanks, ikegami. Your comments made the code much more readable. I don't fully understand what the matching is doing though; it does not appear to be splitting the line on delimiters, and I am unclear on the fuction of the parens.
      I made the repairs as you suggest; my new code is still lacking something
      C:\scripts>wordcount.pl alice.txt
      distinct words: 0
      frequency of most common word:
      common word:
      use strict; my $maxcount; my $find; my $file; my %hash; my $count; while(<>){ my @list = shift =~ /([a-xA-Z'\-]+)/g; foreach my $word (@list) { $count =++$hash{lc $word}; if ($count > $maxcount) { $maxcount = $count; } } } my $numwords= keys %hash; print "distinct words: $numwords\n"; print "frequency of most common word: $maxcount\n"; print "common word: $find";
        Changing
        while(<>){ my @list = shift =~ /([a-xA-Z'\-]+)/g;
        to
        while(<>){ my @list = $_ =~ /([a-xA-Z'\-]+)/g;
        gave me results now; could you explain a little what the match is doing to parse the lines?
        C:\scripts>wordcount.pl alice.txt
        distinct words: 2815
        frequency of most common word: 1779
        common word:
Re: using hash to find frequency count
by jpeg (Chaplain) on May 15, 2005 at 03:32 UTC
    A few things:
    • Doesn't the code instantiate a new %hash and $count every time read_line is called?
    • since you're resetting $count for every line, $count will never > $maxcount and $find is never set, as you see in the output.
    • It's a good idea to separate the data collection steps from the analysis steps. Move that if block into its own sub.
    --
    jpg
      Doesn't the code instantiate a new %hash and $count every time read_line is called?

      No. our makes it a global, so the function always uses the same variable, %main::hash.

      Since $count comes from a constantly updated global, it grows bigger than $maxcount. If $count never outgrew $maxcount, $maxcount would never be higher than 1, yet it's 5160 for the OP's test run.

      since you're resetting $count for every line, $count will never > $maxcount and $find is never set, as you see in the output.

      Actually, you can see that $maxcount IS set in the output. $find is also set, but it's not so obvious because it happens to be set to the 0-length string.

      It's a good idea to separate the data collection steps from the analysis steps. Move that if block into its own sub.

      Actually, he already did that. readfile reads the lines, and the misnamed readline processes (analyses) them.

Re: using hash to find frequency count
by ghenry (Vicar) on May 15, 2005 at 11:39 UTC

    Maybe you could get some ideas from my Wordcounting with DBM files - can't open file? node.

    Maybe not ;-)

    Walking the road to enlightenment... I found a penguin and a camel on the way.....
    Fancy a yourname@perl.me.uk? Just ask!!!
Re: using hash to find frequency count
by tlm (Prior) on May 15, 2005 at 14:54 UTC

    Aside from the question of how suitable the items in @list are for your purposes, the logic in read_line's loop doesn't make sense to me. I think that what you want is:

    foreach my $word (@list) { print $word, $/ unless $hash{ lc $word }++; $count = $hash{ lc $word }; if ($count > $maxcount) { $maxcount = $count; $find = $word; } }

    the lowliest monk