IB2017 has asked for the wisdom of the Perl Monks concerning the following question:
Dear monks
I need some help in adapting the following Regex:
my ($rx) = map qr/(?:$_)/, join "|", map qr/\b\Q$_\E\b/, @stopwords;
which I use to remove from a string all stopwords contained in @stopwords by means of $string =~ s/$rx//g;. This works fine except for the cases that a word contains a hyphen and one part of the word happens to be a stop word. So for example, the French word "sous-alimentation" looses "sous" (being "sous" in my @stowords), even if it should retain it as being part of a terminological unit. Any idea how can I avoid it? (I thought to use whitespaces in my $rx, but I guess they could cause problems at the beginning and end of the string). Thank you.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Improving regular expression to remove stopwords
by hippo (Archbishop) on Jan 09, 2019 at 12:01 UTC | |
by IB2017 (Pilgrim) on Jan 09, 2019 at 13:18 UTC | |
|
Re: Improving regular expression to remove stopwords
by Veltro (Hermit) on Jan 09, 2019 at 12:06 UTC | |
|
Re: Improving regular expression to remove stopwords
by AnomalousMonk (Archbishop) on Jan 09, 2019 at 18:39 UTC |