Counting and Filtering Words From File

maxamillionk has asked for the wisdom of the Perl Monks concerning the following question:

Brief: Basically I am trying to work on data from a file as a big string, but the diamond <> only reads one line at a time and (I think) that's why it's slow.

Long explanation: I have a bash shell script that concatenates all text of an input file into memory, then it will count all words, and the number of times the words appear, in descending order. It will also filter out common words and delete empty lines. It takes 4 seconds to finish on large files. I have tried to re-create this in Perl:

#!/usr/bin/perl
# word counting program
use strict;
use warnings;
use autodie;

# list of excluded words
my @excluded = qw( a about although also an and another are as at be b
+een
    before between but by can do during for from has how however in in
+to is
    it many may more most etc );

# list of excluded characters
my @excluded_chars = ( "\\'", "\\:", "\\@", "\\-", "\\~", "\\,", "\\."
+, "\\(",
    "\\)", "\\?", "\\*", "\\%", "\\/", "\\[", "\\]", "\\=", '"'
    );

my %count;  # this will contain many words

while (<>) {
    foreach (split) {
        s/ ([A-Z]) /\L$1/gx;  # lowercase each word
        # remove non-letter characters
        foreach my $char (@excluded_chars) {
            $_ =~ s/$char//g;
        }
        # remove excluded words
        foreach my $word (@excluded) {
            $_ =~ s/\b$word\b//g;
        }
        $count{$_}++;  # count each separate word
    }
}

foreach my $word (sort { $count{$a} <=> $count{$b} or $a cmp $b } keys
+ %count) {
    print "$count{$word} $word\n";
}
[download]

However, it takes way too long, about 40 to 50 seconds to finish. I am guessing it's because it reads one line at a time from the file. What is a good way to make Perl go faster? (I'm a noob!) For contrast, below is the bash shell script that runs way faster.

#!/bin/bash
# input a file name like this:
#
# count_mem.sh filename.txt
#

if [ $# -eq 0 ]; then
    echo "example usage: $(basename $0) file.txt" >&2
    exit 1
elif [ $# -ge 2 ]; then
    echo "too many arguments" >&2
    exit 2
fi

sed s/' '/\\n/g "$1" | tr -d '[\.[]{}(),\!\\'\'''\"'\`\~\@\#\$\%\^\&\*
+\+\=\|\;\:\<\>\?]' | tr [:upper:] [:lower:] | 
sed "s/\blong\b//gi" | 
sed "s/\blist\b//gi" | 
sed "s/\bof\b//gi" | 
sed "s/\bexcluded\b//gi" | 
sed "s/\bwords\b//gi" | 
sed "s/\bhere\b//gi" | 
sed '/^$/d' | 
sort | uniq -c | sort -nr | less
[download]

I apologize if a similar question was asked before - I tried searching and wasn't able to find a thread resembling my own.

Comment on Counting and Filtering Words From File Select or Download Code

Replies are listed 'Best First'.
Re: Counting and Filtering Words From File by haukex (Archbishop) on May 09, 2020 at 22:07 UTC
I am guessing it's because it reads one line at a time from the file. My guess is different: you've got several nested loops - for each word, you loop over `@excluded` and `@excluded_chars` and run a regex for each, so with what you showed here, that's 50 regex runs per word. Instead, see Building Regex Alternations Dynamically. Also, I use lc to lowercase instead of a regex, which handles unicode as well (though you would need to make sure your inputs are properly decoded for that case). #!/usr/bin/env perl use warnings; use strict; use autodie; my @excluded_words = qw( a about although also an and another are as at be been before between but by can do during for from has how however in into is it many may more most etc ); my @excluded_chars = ( "'", ':', '@', '-', '~', ',', '.', '(', ')', '?', '', '%', '/', '[', ']', '=', '"' ); my ($word_regex) = map {qr/$_/} join '\|', map {"\\b".quotemeta."\\b"} sort { length $b <=> length $a or $a cmp $b } @excluded_words; my ($char_regex) = map {qr/$_/} join '\|', map {quotemeta} sort { length $b <=> length $a or $a cmp $b } @excluded_chars; my %count; while (<>) { for (split) { $_ = lc; s/$char_regex//g; s/$word_regex//g; $count{$_}++; } } for my $word ( sort { $count{$a} <=> $count{$b} or $a cmp $b } keys %count ) { print "$count{$word} $word\n"; } [download] On my machine on a test file, this is faster than the original code by a factor of roughly 7x. Update:* The first regex is written a bit more cleanly as: `my ($word_regex) = map {qr/\b(?:$_)\b/} join '\|', map {quotemeta} sort ...`	[reply] [d/l] [select]
Re^2: Counting and Filtering Words From File by maxamillionk (Acolyte) on May 10, 2020 at 00:55 UTC
Ok that is cool, I didn't know about 'map' and 'quotemeta'. I think your code runs about the same speed as the shell script.	[reply]
Re: Counting and Filtering Words From File by tybalt89 (Monsignor) on May 10, 2020 at 00:34 UTC
Try this. It does the lc() and the tr/// only once. `#!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11116620 use warnings; my @excluded = qw( a about although also an and another are as at be b +een before between but by can do during for from has how however in in +to is it many may more most etc ); local $/; my %count; $count{$_}++ for split ' ', (lc <>) =~ tr!-'@~,.()?*%/[]="!!dr; delete @count{@excluded}; print "$count{$_} $_\n" for sort { $count{ $b } <=> $count{ $a } \|\| $a cmp $b } keys %count;` [download]	[reply] [d/l]
Re^2: Counting and Filtering Words From File by hippo (Archbishop) on May 10, 2020 at 10:42 UTC
Independently I have arrived at a similar solution. `#!/usr/bin/env perl use strict; use warnings; my %excluded = map { $_ => 1 } qw( a about although also an and another are as at be been before between but by can do during for from has how however in in +to is it many may more most etc ); my %count; { local $/ = ""; while (<>) { tr {A-Z':@~,.()?*%/[]="-}{a-z}d; foreach (split) { $count{$_}++ unless $excluded{$_}; } } } foreach my $word (sort { $count{$a} <=> $count{$b} or $a cmp $b } keys + %count) { print "$count{$word} $word\n"; }` [download] I've leveraged the requirement of only lowercasing the ascii letters by incorporating it into the `tr///` and I've gone for paragraph mode instead of a single slurp, just in case :-) Both solutions run in similar times and about 100x faster than the original code: `$ time ./11116620.pl < Frankenstein.txt > orig.out real 0m9.381s user 0m9.366s sys 0m0.007s $ time ./wordcount.pl < Frankenstein.txt > hippo.out real 0m0.089s user 0m0.081s sys 0m0.008s $ time ./tybalt.pl < Frankenstein.txt > tybalt.out real 0m0.090s user 0m0.084s sys 0m0.005s` [download] There are some minor differences between all three outputs but without a tighter spec these aren't overly concerning.	[reply] [d/l] [select]
Re: Counting and Filtering Words From File by jwkrahn (Abbot) on May 09, 2020 at 22:49 UTC
I don't understand your "list of excluded characters". In the bash version you are using `tr` which only works on characters but in the Perl version your list uses two character strings instead of single characters. Also, Perl has a `tr///` operator which operates pretty much the same as the command line tool (on single characters). Your "list of excluded words" would probably be better as a hash. Reading a file a line at a time shouldn't be a problem as Perl IO is buffered, however, you can change how much data is read in by changing the Input Record Separator ($/) to read in a whole file. The comment in your program says "# remove non-letter characters", however you are only removing certain punctuation characters. Perhaps you want something like this: #!/usr/bin/perl # word counting program use strict; use warnings; use autodie; # list of excluded words my %excluded = map { $_ => 1 } qw( a about although also an and anothe +r are as at be been before between but by can do during for from has how however in in +to is it many may more most etc ); ## list of excluded characters #my @excluded_chars = ( "\\'", "\\:", "\\@", "\\-", "\\~", "\\,", "\\. +", "\$", # "\$", "\\?", "\\", "\\%", "\\/", "\\[", "\\]", "\\=", '"' # ); # Read in whole file (uncomment next line) # local $/; my %count; # this will contain many words while (<>) { # remove punctuation tr{':@\-~,.()%/[]="}{}d; foreach ( split ) { # remove excluded words next if exists $excluded{ lc() }; ++$count{ lc() }; # count each separate word } } foreach my $word (sort { $count{ $a } <=> $count{ $b } \|\| $a cmp $b } +keys %count) { print "$count{$word} $word\n"; } [download]	[reply] [d/l] [select]
Re: Counting and Filtering Words From File by perlfan (Parson) on May 12, 2020 at 02:33 UTC
File::Slurp or Path::Tiny may be your friends. Beware, your system's memory may not like you if the file is too big. Beyond that, suggestions below about improving your algorithm and utlizing Perl builtins seem to suffice.	[reply]