Adetque has asked for the wisdom of the Perl Monks concerning the following question:

I'm fairly new to Perl, so this shouldn't be very hard to answer.

Anyway, I'm working on a function to return a hash with how often each word occurs in a text file. I have a regex for things that shouldn't be part of a word (such as whitespace and parentheses) to determine where a word ends and a new word starts. However, I found that when there's a line that has no text on it other than a newline, it adds the newline to the hash as it would with a word, and I have no idea how to fix this, nor if it's a problem with read or my regex. Here's my code:

#!/usr/bin/perl use strict; use warnings; sub readWords { ## Gets how many of each word are in a file and returns a hash my $file = shift; my %words = (); my $currentWord = ""; # What characters to ignore my $blacklist = '[\s~`!@#\$%\^&\*\(\)\{\}\+=\\\/\[\]\.\,<>\?;:"]'; open(my $FILE, "<", $file) or die("$0: $file: $!\n"); while(!eof($FILE)) { while(read($FILE, my $letter, 1)) { if($letter !~ /$blacklist/) { $currentWord .= lc($letter); } else { last; } } if(!defined($words{$currentWord})) { $words{$currentWord} = 0; } $words{$currentWord}++; $currentWord = ""; } close($FILE); return %words; } sub main() { my %words = readWords($ARGV[0]); my @keys = keys(%words); my @commonWord = ("", 0); foreach my $key (sort(@keys)) { if($words{$key} > $commonWord[1]) { @commonWord = ("$key", $words{$key}); } print("$key: $words{$key}\n"); } print("Number of unique words: " . scalar(@keys) . "\n"); print("Most common word: $commonWord[0] - used $commonWord[1] time +s\n"); } main();

Thanks in advance if you decide to help.

Replies are listed 'Best First'.
Re: The read function and newlines
by roboticus (Chancellor) on Jul 03, 2010 at 19:57 UTC

    Adetque:

    I don't see a way for a newline to be in your list. However, it looks like you'll get an empty string in your hash: You read the file with a newline, the newline is ignored, then you hit the end of the file. So $currentWord is "", which doesn't exist in your hash, so it's added. You'll probably want to verify that $currentWord isn't empty before stuffing it into your hash.

    Having said that, though, I think I'd just use split to get your list of words and enter them into the hash--something like this (untested):

    sub readWords { ## Gets how many of each word are in a file and returns a hash my $file = shift; my %words = (); # What characters to ignore my $blacklist = qr{[\s~`!@#\$%\^&\*\(\)\{\}\+=\\\/\[\]\.\,<>\?;:"] ++}; open(my $FILE, "<", $file) or die("$0: $file: $!\n"); while(my $currentline = <$FILE>) { $words{$_}++ for split $blacklist, $currentline; } close($FILE); return %words; }

    ...roboticus

    Update: I just tested the function and it works. A couple observations, though:

    • You're returning a hash, but you may want to consider returning a hash reference instead.
    • Your blacklist still allows some non-word characters in it (e.g. ' and _). You might want to use:
         my $blacklist = qr{\W+};
    • The function allows an empty string to be added, so you might want to add (before the return statement):
         delete $words{''};

      Thanks. Both the split function and deleting the key work.

      Also, I purposely didn't remove apostrophes since they could be part of a word, and I just forgot about underscores.

        Also, I purposely didn't remove apostrophes since they could be part of a word, and I just forgot about underscores.

        Heh ... I hadn't even considered contractions. I'm glad it's working for you!

        ...roboticus

Re: The read function and newlines
by chromatic (Archbishop) on Jul 04, 2010 at 03:04 UTC

      Which left you with lots of nonsensical words like 's', 't', 'm', 're', 've', 'll', & 'isn', 'don', 'doesn', 'aren', 'wouldn', 'shouldn', 'couldn'...

      Much the same as if you had used split '\W+', $scalar. Because, after loading jiggabytes of stuff you're not going to use, that's exactly what you did. Just more slowly, laboriously and obscurely. And without an easy option for correcting for the above limitations.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        And without an easy option for correcting for the above limitations.

        See the set_non_word_regexp() method.

        Because, after loading jiggabytes of stuff you're not going to use....

        Here's a tuppence; go buy yourself a second megabyte of RAM.