in reply to Counting and Filtering Words From File
I am guessing it's because it reads one line at a time from the file.
My guess is different: you've got several nested loops - for each word, you loop over @excluded and @excluded_chars and run a regex for each, so with what you showed here, that's 50 regex runs per word. Instead, see Building Regex Alternations Dynamically. Also, I use lc to lowercase instead of a regex, which handles unicode as well (though you would need to make sure your inputs are properly decoded for that case).
#!/usr/bin/env perl use warnings; use strict; use autodie; my @excluded_words = qw( a about although also an and another are as at be been before between but by can do during for from has how however in into is it many may more most etc ); my @excluded_chars = ( "'", ':', '@', '-', '~', ',', '.', '(', ')', '?', '*', '%', '/', '[', ']', '=', '"' ); my ($word_regex) = map {qr/$_/} join '|', map {"\\b".quotemeta."\\b"} sort { length $b <=> length $a or $a cmp $b } @excluded_words; my ($char_regex) = map {qr/$_/} join '|', map {quotemeta} sort { length $b <=> length $a or $a cmp $b } @excluded_chars; my %count; while (<>) { for (split) { $_ = lc; s/$char_regex//g; s/$word_regex//g; $count{$_}++; } } for my $word ( sort { $count{$a} <=> $count{$b} or $a cmp $b } keys %count ) { print "$count{$word} $word\n"; }
On my machine on a test file, this is faster than the original code by a factor of roughly 7x.
Update: The first regex is written a bit more cleanly as: my ($word_regex) = map {qr/\b(?:$_)\b/} join '|', map {quotemeta} sort ...
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Counting and Filtering Words From File
by maxamillionk (Acolyte) on May 10, 2020 at 00:55 UTC |