Re: Counting and Filtering Words From File

I am guessing it's because it reads one line at a time from the file.

My guess is different: you've got several nested loops - for each word, you loop over @excluded and @excluded_chars and run a regex for each, so with what you showed here, that's 50 regex runs per word. Instead, see Building Regex Alternations Dynamically. Also, I use lc to lowercase instead of a regex, which handles unicode as well (though you would need to make sure your inputs are properly decoded for that case).

#!/usr/bin/env perl
use warnings;
use strict;
use autodie;

my @excluded_words  = qw( a about although also an and another are as
    at be been before between but by can do during for from has how
    however in into is it many may more most etc );
my @excluded_chars = ( "'", ':', '@', '-', '~', ',', '.', '(', ')',
    '?', '*', '%', '/', '[', ']', '=', '"' );
my ($word_regex) = map {qr/$_/} join '|', map {"\\b".quotemeta."\\b"}
    sort { length $b <=> length $a or $a cmp $b } @excluded_words;
my ($char_regex) = map {qr/$_/} join '|', map {quotemeta}
    sort { length $b <=> length $a or $a cmp $b } @excluded_chars;

my %count;
while (<>) {
    for (split) {
        $_ = lc;
        s/$char_regex//g;
        s/$word_regex//g;
        $count{$_}++;
    }
}

for my $word ( sort { $count{$a} <=> $count{$b}
    or $a cmp $b } keys %count ) {
        print "$count{$word} $word\n";
}
[download]

On my machine on a test file, this is faster than the original code by a factor of roughly 7x.

Update: The first regex is written a bit more cleanly as: my ($word_regex) = map {qr/\b(?:$_)\b/} join '|', map {quotemeta} sort ...

Comment on Re: Counting and Filtering Words From File Select or Download Code

Replies are listed 'Best First'.
Re^2: Counting and Filtering Words From File by maxamillionk (Acolyte) on May 10, 2020 at 00:55 UTC
Ok that is cool, I didn't know about 'map' and 'quotemeta'. I think your code runs about the same speed as the shell script.	[reply]