counting number of occurrences of words in a file

derpp has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: counting number of occurrences of words in a file by ikegami (Patriarch) on Aug 11, 2010 at 18:26 UTC
"The" ne "the". You should normalise the case. Other issues: You use `use warnings;`. Great! You should also use `use strict;`, though. Why loop when you know you're only going to get one value? `undef($/); while (<FILE>) { }` [download] should be `undef($/); $_ = <FILE>;` [download] Use of alternation inside character class. `s/[\,\|\.\|\!\|\?\|\:\|\;\|\"\|\'\|\<\|\>]//g;` [download] is the same as `s/[\,\.\!\?\:\;\"\'\<\>\|\|\|\|\|\|\|\|]//g;` [download] and `s/[\,\.\!\?\:\;\"\'\<\>\|]//g;` [download] You want an alternation `s/\,\|\.\|\!\|\?\|\:\|\;\|\"\|\'\|\<\|\>//g;` [download] or a character class `s/[\,\.\!\?\:\;\"\'\<\>]//g;` [download] Useless escaping. For readability, `@array = split(/\ /, $_); s/[\,\.\!\?\:\;\"\'\<\>]//g;` [download] should be `@array = split(/ /, $_); s/[,.!?:;"'<>]//g;` [download] `'<insertfilepath>' \|\| $!` [download] is the same as just `'<insertfilepath>'` [download] since the file name will always be a true value. Useless use of global variables (`FILE`), and unlocalised changes to global variables (`$/`). Splitting on spaces won't split on newlines, and will produce empty strings when there are two spaces in a row. Split on special `' '` instead. `@array = split(/ /, $_);` [download] should be `@array = split(' ', $_);` [download] `use strict; use warnings; my $file; { my $qfn = '<insertfilepath>'; open(my $FILE, '<', $qfn) or die("Can't open \"$qfn\": $!\n"); local $/; $file = <$FILE>; } my %word_counts; for (split(' ', $file)) { s/[,.!?:;"'<>]//g; ++$word_counts{lc($_)}; } for my $word (sort keys(%word_counts)) { print "$word occurred $word_counts{$word} times\n"; }` [download] Update: There's no reason to load the entire file into memory at once, and if you don't, you gain the ability to pass a file name on the command line. `use strict; use warnings; my %word_counts; while (<>) { for (split(' ', $_)) { s/[,.!?:;"'<>]//g; ++$word_counts{lc($_)}; } } for my $word (sort keys(%word_counts)) { print "$word occurred $word_counts{$word} times\n"; }` [download]	[reply] [d/l] [select]
Re: counting number of occurrences of words in a file by kennethk (Abbot) on Aug 11, 2010 at 18:32 UTC
A couple of critiques of your posted code: You use warnings but not strict; is there a reason? Your open test doesn't do what you think. The C style Logical Or (`\|\|`) is higher precedence than the Comma Operator, so as long as your file path is not logically false, it is a null op. In addition, it's inside parentheses. The smallest change that will yield code that functions as you likely expect is `open (FILE, '<insertfilepath>') \|\| die $!;` though I personally would use something closer to `open (my $fh, '<', '<insertfilepath>') or die "Open failed : $!"; undef($/); while (<$fh>) {` [download] See perlopentut. The default behavior for split with no arguments will do what you intend: it splits $_ on one or more consecutive whitespace characters. Your expression likely does not do what you intend for `Hello. How are you?` since it creates an empty entry for the double space after the period. I'd swap the line to: `my @array = split;` or at least `my @array = split(/\s+/,$_);` You never use a scalar named `$word` but you declare one - another no-op. You likely mean `my %word;`. See Perl variable types in perlintro. Rather than try and define every possible non-word character, you should use character classes. So replace `s/[\,\|\.\|\!\|\?\|\:\|\;\|\"\|\'\|\<\|\>]//g;` with `s/\W//g`. This is not literally identical, but if you are just using English language sources w/o mathematical formulas you are pretty well safe. See perlretut. You don't account for variations in capitalization - I suspect this is the bug you are encountering. You should lowercase the result to compensate, either with `$_ = lc;` or `tr/A-Z/a-z/;` You also have a scoping issue with overwriting `@array` that you avoided through luck because you slurp the file and don't enforce strict. With all these changes, your code might look like: `#!/usr/bin/perl use strict; use warnings; open (my $fh, '<', '<insertfilepath>') or die "Open failed : $!"; undef($/); my %word; while (<$fh>) { my @array = split(/\s+/, $_); foreach (@array) { print "$_\n"; } for (@array){ s/\W//g; tr/A-Z/a-z/; $word{$_}++; } } for (sort(keys %word)) { print "$_ occurred $word{$_} times\n"; }` [download]	[reply] [d/l] [select]
Re^2: counting number of occurrences of words in a file by derpp (Acolyte) on Aug 11, 2010 at 19:01 UTC
Thank you! Using some of your suggestions and the suggestions of the guy above, I somehow managed to fix my problem. I don't think it was the capitalization part that was part of the problem, though. It was more because of my split and a few other problems, like bad placement. In answer to your question about why I don't use strict, there is none. I just don't like it. Most of my stuff is casual programming, since I am a mere beginner and I prefer to add it in after I have finished and tested my program.	[reply]
Re^3: counting number of occurrences of words in a file (use strict) by toolic (Bishop) on Aug 11, 2010 at 19:17 UTC
why I don't use strict, there is none. I just don't like it. Just do it. In most cases, all it means is that you have to type only 3 extra characters (m-y-space) before you declare a variable. After you start doing this, it becomes muscle memory. Don't make me dedicate a poem to you :)	[reply]
Re^3: counting number of occurrences of words in a file by ssandv (Hermit) on Aug 11, 2010 at 19:41 UTC
Most monks "just don't like" helping people who don't `use strict`. When you ask for advice, and don't take it when it's given, people remember that. Beginners are the ones who generally benefit the most from `strict`. You are unambiguously making a mistake, and making it harder for people to help you. It's not a matter of taste.	[reply] [d/l] [select]
Re^3: counting number of occurrences of words in a file by kennethk (Abbot) on Aug 11, 2010 at 19:08 UTC
I would strongly urge you to reconsider your position. strict's greatest benefits in my experience come from catching typos during initial code development, and will help make issues like Variable scoping more clear in your mind. Give Use strict warnings and diagnostics or die a read and see if you don't change your mind. I always use it for any script that's more than a dozen lines.	[reply]
Re^3: counting number of occurrences of words in a file by ww (Archbishop) on Aug 11, 2010 at 20:20 UTC
Re `use strict;`, you said: "since I am a mere beginner and I prefer to add it in after I have finished and tested my program." To me, and (I strongly suspect) to many other Monks, this is evidence that you don't understand the purpose or benefits of invoking the pragma. Rather than an (add-in\|\|nice-to-have\|\|safety valve), `strict` exists in large measure to help you get your program "right" whilst finishing and testing (and anything else). If you start coding with `strict` in use, perl will provide valuable information about any errors anytime you look for them.	[reply] [d/l] [select]
Re^3: counting number of occurrences of words in a file by graff (Chancellor) on Aug 12, 2010 at 02:11 UTC
And in case the replies above haven't convinced you... I've been using perl for 15 years; I've been including "use strict" for more than half of that time. Adding it to my normal habit has made me a better programmer, because it causes me to think about the constraints it imposes while I am writing code (especially the required scope for variables). And as mentioned above, it also makes it easier and quicker to get my code to work as intended, because it catches mistakes that might otherwise be hard to spot. The only time I don't use it is when I am doing spontaneous one-liners at the shell command line, because in that situation, compactness carries greater value, and the code I need is relatively short and simple (requiring few if any variables). But every script that I store as a file has "use strict" in it.	[reply]