in reply to How much can this text processing be optimized?
Well several things could be improved.
There is no need to remove the words that you don't want from the string, just include the words that you do want.
If you are going to use precompiled regexes, I think you will get better performance by moving the regex outside of the loop.
As was previously mentioned, accumulating the file into a scalar and then acting on the scalar is going to give you a large performance hit.
If I was going to do something like this, (and I have in the past,) I would probably write it like this:
########################################################### #! /usr/bin/perl use warnings; use strict; my $word = qr/\b[a-zA-Z\x{131}ü\x{11F}\x{15F}çö\x{130}\x{15E}Ö\x{11E}Ü +]+\b/; my %mywordcount; while (<STDIN>) { my $line = $_; while ($line =~ /($word)/g){ $mywordcount{(lc $1)}++; } } print "Word\t\t\tFrequency\n"; print "======\t\t\t===========\n"; #sorting alphabetically print "$_\t\t", (length($_) > 7) ? '' : "\t", $mywordcount{$_}, "\n" f +or sort keys %mywordcount;
Update: Changed to STDIN fh, I tested with default and forgot to change before posting.
FWIW, on my system, this takes about 2 seconds to process a 1.6 MB file.
Also, it isn't specified in the original post, but if you want to allow for words with an internal apostrophe, (don't, you'll, you're, etc.,) change the line
while ($line =~ /($word)/g){
to
while ($line =~ /($word('$word)?)/g){
Update 2: Sigh. I realized that my regex wouldn't work correctly with words containing non-ASCII character too. (\b doesn't work for multi-byte characters.)
Should be:
my $word = qr/(?<!\p{Alpha})[a-zA-Z\x{131}ü\x{11F}\x{15F}çö\x{130}\x{1 +5E}Ö\x{11E}Ü]+(?!\p{Alpha})/;
|
|---|