I presume you are involved in some process that removes punctuation as well. If you're not, then removal of just the stopwords will leave behind some odd patterns (e.g. if input includes things like "about-face", "morning-after pill", "man-about-town", and so on).open( LIST, "mystopwords.txt" ) or die "$!"; my @stopwords = <LIST>; # assuming one stop word per line close LIST; chomp @stopwords; my $stopregex = join '|', @stopwords; # ... now, when you go to delete stopwords from $_, # it goes like this: s/\b(?:$stopregex)\b//g;
(You didn't mention whether you were using the "g" modifier when removing the stop words. Could that have been your problem?)
Another update: the regex approach works fine and might even be optimal, but there's another way, of course:
my %stopwd; open( LIST, "my_stopwords.txt" ) or die "$!"; while (<LIST>) { chomp; $stopwd{$_} = undef; # assume one word per line } close LIST; # now, to remove stopwords, split the input data ($_) on \b # and check each token: my $filtered = join '', map { exists($stopwd{$_}) ? '':$_ } split /\b/ +;
In reply to Re: removing stop words
by graff
in thread removing stop words
by zulqernain
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |