Hello
I used to check if a work needs to be exluded from processing checking if it is contained in a stop words list. I used this method:
my $CkDiscardCommonwords=1;#check if use stopwords or not my $term="word"; my $commonwordsRX = loadCommonWords (); if ($CkDiscardCommonwords eq 1){ if ($term =~ /^(?:$commonwordsRX)$/){ return (0); } } sub loadCommonWords { my @commonwords; my $filename="commonWords.txt"; if (open $FH, "<:encoding(UTF-8)", $filename) { while (my $line = <$FH>) { chomp $line; push @commonwords, $line; } close $FH; } my $commonwordsRX = join "|", map quotemeta, @commonwords; return $commonwordsRX; }
Now my sooftware has changed and the list of common words saved in commonWords.txt may grow exponencially. It used to be small (~300 words), now it could reach x-thousands.
I would like to hear what expert monks think about this implementation. Would a Regex constructed in this way cause problems when it grows? Should I choose another approach?
In reply to Filtering out stop words by IB2017
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |