in reply to Count number of occurrences of a list of words in a file
Hello Azaghal,
The bottleneck occurs here, in the inner loop:
while (my $line = <$fh>) { chomp $line; foreach my $mot (keys (%count)) { chomp $mot; foreach my $str ($line =~ /$mot/g) { $count{$str}++; } } }
If %count contains 60,000 entries, then the foreach loop performs 60,000 regex tests against each line of the input text file! Fortunately, this is quite unnecessary. I would split each line into words and simply lookup these words in the hash; like this (untested):
while (my $line = <$fh>) { chomp $line; my @words = split /\W+/, $line; for my $word (@words) { ++$count{$word} if exists $count{$word}; } }
(You may need to tweak the split regex, depending on the contents of the words in the list file.)
Hope that helps,
Athanasius <°(((>< contra mundum | Iustus alius egestas vitae, eros Piratica, |
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: Count number of occurrences of a list of words in a file
by Tux (Canon) on May 09, 2018 at 16:31 UTC |