derpp has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I've been posting tons of questions, and thanks for all the replies. I'm sucha nublet at this. But anyway, here's my problem: I can successfully get my program to count the number of occurences, but for some reason, certain words, like 'The' and other commonly used words are reappearing, even though they had been counted already. My only guess is that Perl's reading it in paragraphs?

The website I copied my words from is http://money.cnn.com/2010/08/10/news/companies/walt_disney_earnings/index.htm . the third, fourth, fifth, and sixth paragraph. Just put it in a word document and open it up.

Sorry. I know this is really troublesome.
use warnings; open (FILE, '<insertfilepath>' || $!); undef($/); while (<FILE>) { @array = split(/\ /, $_); my $word; foreach (@array) { print "$_\n"; } } for (@array){ s/[\,|\.|\!|\?|\:|\;|\"|\'|\<|\>]//g; $word{$_}++; } for (sort(keys %word)) { print "$_ occurred $word{$_} times\n"; }

Replies are listed 'Best First'.
Re: counting number of occurrences of words in a file
by ikegami (Patriarch) on Aug 11, 2010 at 18:26 UTC

    "The" ne "the". You should normalise the case.

    Other issues:

    • You use use warnings;. Great! You should also use use strict;, though.

    • Why loop when you know you're only going to get one value?

      undef($/); while (<FILE>) { }
      should be
      undef($/); $_ = <FILE>;
    • Use of alternation inside character class.

      s/[\,|\.|\!|\?|\:|\;|\"|\'|\<|\>]//g;
      is the same as
      s/[\,\.\!\?\:\;\"\'\<\>||||||||]//g;
      and
      s/[\,\.\!\?\:\;\"\'\<\>|]//g;
      You want an alternation
      s/\,|\.|\!|\?|\:|\;|\"|\'|\<|\>//g;
      or a character class
      s/[\,\.\!\?\:\;\"\'\<\>]//g;
    • Useless escaping. For readability,

      @array = split(/\ /, $_); s/[\,\.\!\?\:\;\"\'\<\>]//g;
      should be
      @array = split(/ /, $_); s/[,.!?:;"'<>]//g;
    • '<insertfilepath>' || $!
      is the same as just
      '<insertfilepath>'

      since the file name will always be a true value.

    • Useless use of global variables (FILE), and unlocalised changes to global variables ($/).

    • Splitting on spaces won't split on newlines, and will produce empty strings when there are two spaces in a row. Split on special ' ' instead.

      @array = split(/ /, $_);
      should be
      @array = split(' ', $_);
    use strict; use warnings; my $file; { my $qfn = '<insertfilepath>'; open(my $FILE, '<', $qfn) or die("Can't open \"$qfn\": $!\n"); local $/; $file = <$FILE>; } my %word_counts; for (split(' ', $file)) { s/[,.!?:;"'<>]//g; ++$word_counts{lc($_)}; } for my $word (sort keys(%word_counts)) { print "$word occurred $word_counts{$word} times\n"; }

    Update: There's no reason to load the entire file into memory at once, and if you don't, you gain the ability to pass a file name on the command line.

    use strict; use warnings; my %word_counts; while (<>) { for (split(' ', $_)) { s/[,.!?:;"'<>]//g; ++$word_counts{lc($_)}; } } for my $word (sort keys(%word_counts)) { print "$word occurred $word_counts{$word} times\n"; }
Re: counting number of occurrences of words in a file
by kennethk (Abbot) on Aug 11, 2010 at 18:32 UTC
    A couple of critiques of your posted code:

    1. You use warnings but not strict; is there a reason?
    2. Your open test doesn't do what you think. The C style Logical Or (||) is higher precedence than the Comma Operator, so as long as your file path is not logically false, it is a null op. In addition, it's inside parentheses. The smallest change that will yield code that functions as you likely expect is open (FILE, '<insertfilepath>') || die $!; though I personally would use something closer to
      open (my $fh, '<', '<insertfilepath>') or die "Open failed : $!"; undef($/); while (<$fh>) {
      See perlopentut.
    3. The default behavior for split with no arguments will do what you intend: it splits $_ on one or more consecutive whitespace characters. Your expression likely does not do what you intend for Hello.  How are you? since it creates an empty entry for the double space after the period. I'd swap the line to:
      my @array = split;
      or at least
      my @array = split(/\s+/,$_);
    4. You never use a scalar named $word but you declare one - another no-op. You likely mean my %word;. See Perl variable types in perlintro.
    5. Rather than try and define every possible non-word character, you should use character classes. So replace s/[\,|\.|\!|\?|\:|\;|\"|\'|\<|\>]//g; with s/\W//g. This is not literally identical, but if you are just using English language sources w/o mathematical formulas you are pretty well safe. See perlretut.
    6. You don't account for variations in capitalization - I suspect this is the bug you are encountering. You should lowercase the result to compensate, either with $_ = lc; or tr/A-Z/a-z/;
    7. You also have a scoping issue with overwriting @array that you avoided through luck because you slurp the file and don't enforce strict.
    With all these changes, your code might look like:

    #!/usr/bin/perl use strict; use warnings; open (my $fh, '<', '<insertfilepath>') or die "Open failed : $!"; undef($/); my %word; while (<$fh>) { my @array = split(/\s+/, $_); foreach (@array) { print "$_\n"; } for (@array){ s/\W//g; tr/A-Z/a-z/; $word{$_}++; } } for (sort(keys %word)) { print "$_ occurred $word{$_} times\n"; }
      Thank you! Using some of your suggestions and the suggestions of the guy above, I somehow managed to fix my problem. I don't think it was the capitalization part that was part of the problem, though. It was more because of my split and a few other problems, like bad placement.

      In answer to your question about why I don't use strict, there is none. I just don't like it. Most of my stuff is casual programming, since I am a mere beginner and I prefer to add it in after I have finished and tested my program.

        why I don't use strict, there is none. I just don't like it.
        Just do it. In most cases, all it means is that you have to type only 3 extra characters (m-y-space) before you declare a variable. After you start doing this, it becomes muscle memory.

        Don't make me dedicate a poem to you :)

        Most monks "just don't like" helping people who don't use strict. When you ask for advice, and don't take it when it's given, people remember that. Beginners are the ones who generally benefit the most from strict. You are unambiguously making a mistake, and making it harder for people to help you. It's not a matter of taste.

        I would strongly urge you to reconsider your position. strict's greatest benefits in my experience come from catching typos during initial code development, and will help make issues like Variable scoping more clear in your mind. Give Use strict warnings and diagnostics or die a read and see if you don't change your mind. I always use it for any script that's more than a dozen lines.
        Re use strict;, you said:
        "since I am a mere beginner and I prefer to add it in after I have finished and tested my program."

        To me, and (I strongly suspect) to many other Monks, this is evidence that you don't understand the purpose or benefits of invoking the pragma.

        Rather than an (add-in||nice-to-have||safety valve), strict exists in large measure to help you get your program "right" whilst finishing and testing (and anything else). If you start coding with strict in use, perl will provide valuable information about any errors anytime you look for them.

        And in case the replies above haven't convinced you... I've been using perl for 15 years; I've been including "use strict" for more than half of that time.

        Adding it to my normal habit has made me a better programmer, because it causes me to think about the constraints it imposes while I am writing code (especially the required scope for variables). And as mentioned above, it also makes it easier and quicker to get my code to work as intended, because it catches mistakes that might otherwise be hard to spot.

        The only time I don't use it is when I am doing spontaneous one-liners at the shell command line, because in that situation, compactness carries greater value, and the code I need is relatively short and simple (requiring few if any variables). But every script that I store as a file has "use strict" in it.